Coding Archives - Mathew Kiang (.com)

My slides from a guest lecture on data visualization

MV Kiang — Wed, 03 Aug 2022 02:04:59 +0000

The post My slides from a guest lecture on data visualization appeared first on Mathew Kiang (.com).

Student’s Tay Distribution

MV Kiang — Sun, 27 Dec 2020 20:42:17 +0000

Taylor Swift has recorded 9 albums, each of them (except the most recent) has gone multi-platinum. In total, she has sold over 200 million records, won 10 Grammy’s, an Emmy, 32 AMA’s, and 23 Billboard Music Awards. Not bad for somebody who just turned 31.

This year, she’s managed to release two albums — they’re both very good. However, I noticed there seemed to be more profanity than I had remembered on her older albums. Here, I’ll use tidytext to see if she has actually increased her rate of profanity or if I’m simply misremembering things.

Descriptives

We’ll begin with some simple descriptives. How many words (y-axis) does each album (x-axis) have? Note, the x-axis uses albums as references but ticks are are spaced temporally. Below, we show all words (green) as well as after removing stop words (orange) and counting only distinct words (purple). It appears like the number of distinct words has more or less stayed the same throughout the discography while the number of total words increased a bit and then decreased.

Perhaps this increase in the middle is explained to having more tracks on each album? Below we plot the average number of words per track (green) and the average number of distinct words per track (purple). Between the two plots, I think a likely explanation for the increase in words in the middle albums is due to more repetition and not due to more tracks.

Sentiment

How has sentiment changed over time? That is, have albums gotten more or less sad or happy? Is there a pattern such that the albums get more (or less) happy (or sad) as the album progresses? Below, I plot the average sentiment (y-axis) for each track (x-axis) by album (facets). The lines are fitted generalized additive models.

The sentiment across albums appears more or less flat (i.e., tracks do not get more or less positive as the album progresses). The exception is 1989, where the first track is “Welcome to New York” and “welcome” is classified as “positive”. In addition, the sixth track is “Shake it off” and “hater” (repeated in the chorus) is classified as “negative”. Excluding these two tracks results in a flat line.

Given there appears to be no relationship with average sentiment and track position, maybe albums themselves (across all lyrics and tracks) have different distributions of sentiment? Below is the density of sentiment for individual lyrics (i.e., lines) by album.

Most lyrics (in all albums) are fairly neutral. Reputation is especially pointy (yes, that’s the statistical term), but the last three albums have a bit more mass on both sides of 0 than previous albums.

Profanity

Ok. So now the main question. Has profanity increased over time? Below is the rate of profanity per 1,000 words (y-axis) over time/album (x-axis) by word (colors).

Yes, there’s more profanity in the recent albums than in the older albums. Maybe 2020 is getting to TSwift too? Check out Folklore and Evermore if you haven’t already.

(Code is here.)

The post Student’s Tay Distribution appeared first on Mathew Kiang (.com).

My collaboration network for 2010 to 2020 (+ other plots)

MV Kiang — Thu, 10 Dec 2020 22:36:31 +0000

In what has become a bit of an annual tradition, here is my collaboration network for 2010 to 2020. This year was rough. Of the two first-author papers published this year, one was pre-pandemic. I think it’s fair to say this wasn’t the level of productivity I was expecting of myself. Hopefully, a few projects still in the pipeline will come out early next year.

All that said, I’m thankful for a strong network of kind collaborators who picked up my slack when necessary, checked in on me even when we didn’t have an active project, and understood when childcare issues caused last minute Zoom cancellations.

You’ll have plenty of time to work with famous, smart, and/or fun people — 2020 was a good reminder of the importance of working with kind people.

The first time I made this plot, I noted how many components I had and how disjointed the collaboration networks were. Since then, there’s now (1) a dominant connected component, (2) my NYU component (top middle) that will likely always be disconnected, (3) my Health Policy and Management cluster (top left), which has a reasonable chance of connecting with the rest of the group now that Sara is at Stanford, and (4) a single paper with Alex (middle right), which will almost certainly join the rest of the group at some point. It’s also interesting to note the trajectories of papers (in terms of citations) in the lower left. A couple papers seem to get some traction, but for the most part, my papers tend to add citations at around 5-10 cites per year.

Below is a plot of collaborations (circles) over time (x-axis) by collaborator (y-axis). I’ve worked fairly consistently with two people, Nancy and Jarvis, for six years, which is pretty wild. Most of my collaborations are bursty with a rush of papers and then long dormant periods but a handful are pretty regular with ~1 paper per year. Most of my collaborators are one-time collaborators.

Another thing we can look at is which of my collaborators also collaborate together (conditional on me being on the paper). Below, I show the top ten (in terms of number of collaborations) collaborators with a horizontal bar chart for the number of times we have worked together. The lower right plot shows dots and lines of intersecting collaborators along with how often this subset of collaborators appears in my collaborations (vertical bars).

For example, there are six papers with Jason, Pam, Nancy, and Jarvis and additional two with the same group minus Jason. Sara is the outlier here with a 5 collaborations — none of which involve another top ten collaborator.

Lastly, for kicks I wanted to see who my “most efficient” collaborator is. That is, conditional on more than one project together, who has the highest average number of citations per project?

The answer (two yellow dots in upper left) is Nishant and Rafa at about 150 citations per project. (The upper right is Jarvis followed by Nancy.)

Code is here. Note there are five files and you need to change my_id on line 21 of 01_pull_data.R to your Google Scholar ID.

The post My collaboration network for 2010 to 2020 (+ other plots) appeared first on Mathew Kiang (.com).

Comparing daily (direct) COVID-19 deaths to other causes of death

MV Kiang — Thu, 26 Nov 2020 21:20:28 +0000

It’s easy to get numb at this stage of the pandemic, but a friendly reminder that daily COVID-19 (direct) deaths have been consistently higher than 8 of the top 10 causes of death (in 2018) since April.

We’re on track for over 3,000 deaths per day by Christmas (!!) — things are not good.

Code here. (This figure was last updated on 12/27/2020 — at least some of the decline in the last few days is simply due to holiday reporting delays.)

The post Comparing daily (direct) COVID-19 deaths to other causes of death appeared first on Mathew Kiang (.com).

Applying an intro-level networks concept to deleting tweets

MV Kiang — Fri, 16 Oct 2020 23:34:12 +0000

There are a few services out there that will delete your old tweets for you, but I wanted to delete tweets with a bit more control. For example, there are some tweets I need to keep up for whatever reason (e.g., I need it for verification) or a few jokes I’m proud of and don’t want to delete.

If you just want the R code to delete some tweets based on age and likes, here it is (noting that it is based on Chris Albon’s Python script). In this post, I go over a bit of code about what I thought was an interesting problem: given a list of tweets, how can we identify and group threads?

Below, I plot all my tweets over time (x-axis) by the number of “likes” (y-axis) and I highlight in red tweets that are threaded together. Ignore the boxes for now.

Pulling the data using rtweet, you end up with a dataframe that looks something like this (only with many many more rows and columns):

> before_df %>% select(status_id, created_at, screen_name, text, reply_to_status_id)
# A tibble: 510 x 5
   status_id    created_at          screen_name text                           reply_to_status…
                                                                     
 1 13167994911… 2020-10-15 17:53:58 mathewkiang "@JosephPalamar Haha — the on… 131679883199713…
 2 13167817880… 2020-10-15 16:43:37 mathewkiang "@khayeswilson Ah \"self-liki… 131678029839121…
 3 13167812579… 2020-10-15 16:41:31 mathewkiang "@khayeswilson AH! This is pe… 131678029839121…
 4 13167755278… 2020-10-15 16:18:44 mathewkiang "I've been coding up a script… NA              
 5 13165350914… 2020-10-15 00:23:20 mathewkiang "https://t.co/7PtlUKWTeU http… NA              
 6 13161332751… 2020-10-13 21:46:40 mathewkiang "Data: Full-time academic job… NA              
 7 13144052234… 2020-10-09 03:20:00 mathewkiang "@simonw Thanks for the info!… 131439639207741…
 8 13143912914… 2020-10-09 02:24:38 mathewkiang "@simonw Does this include da… 131439055526721…
 9 13142896495… 2020-10-08 19:40:45 mathewkiang "Me: This paper has been out … NA              
10 13136475049… 2020-10-07 01:09:06 mathewkiang "@Doc_Courtney If by “passing… 131364337679803…
# … with 500 more rows

To select the tweets you want to delete, it is straight-forward to make a rule like: (1) delete all tweets created more than two years ago with fewer than 100 likes (left-most grey box in the plot) or (2) delete all tweets created more than 90 days ago with fewer than 25 likes (bottom grey box in the plot). You could even create a function where the number of likes must be exponentially higher over time. And obviously, you can create a list of tweets (status_ids) that you want to keep.

However, this assumes all tweets are independent. Things get a bit more complicated if you want to treat sets of tweets with the score of any single tweet in the set. If, for example, you string together a twitter thread, you may want to delete or save the entire thread based only on the first tweet since deleting the “unliked” tweets will break up the thread. Twitter doesn’t provide a column that links threads together through a unified ID.

After chatting with Malcolm Barrett about it for a bit, I realized this is a fairly simple network problem. If you imagine the data frame above, where every row is a tweet, as an edge list between vertex status_id and reply_to_status_id, then you can remove all isolates to get trees of threads (most would just be chains). The key code is here but to sketch out the broad points:

Take before_df and (a) filter out isolates (e.g., non-threaded tweets) by making sure each tweet is referred to by another tweet or refers to another tweet within the data frame and (b) removing comments to other people by removing tweets that start with “@”. Because it’s an edge list, we will rename the columns to “from” and “to” and if there is no terminating vertex (i.e., it’s the first tweet in a thread), we will create a self-loop.
```
before_df %>%
        filter(status_id %in% reply_to_status_id | 
                   reply_to_status_id %in% status_id,
               substr(text, 1, 1) != "@") %>% 
        select(from = status_id, to = reply_to_status_id) %>%
        mutate(to = ifelse(is.na(to), from, to))
```

Now just convert this edge list into a graph and extract all the components using igraph

thread_assignments <- thread_df %>%
        graph_from_data_frame(directed = TRUE) %>%
        components()

Now you have a mapping of every threaded tweet ID to a component ID. Below, I just take this mapping and then create a new component ID that is the same as the starting tweet of the thread.

id_mapping <- thread_df %>%
        select(status_id = from) %>%
        left_join(tibble(
            status_id = names(thread_assignments$membership),
            membership = thread_assignments$membership
        )) %>%
        group_by(membership) %>%
        mutate(new_status_id = min(status_id)) %>%
        ungroup()

That’s it! With this mapping, you can left_join() the original data frame and perform manipulations on the thread as a group of tweets rather than each tweet individually. Anyways, check out the gist to see how I deleted the tweets and implemented this. I just thought it was a nice, clean application of an introductory-level network concept to an applied data cleaning problem.

After deleting old and boring tweets and keeping tweets I liked (taking into account groups), I’m left with the black points above. The grey points were tweets that I ended up deleting.

(Disclaimer: There’s almost certainly a better way to do this — I just don’t know it.)

The post Applying an intro-level networks concept to deleting tweets appeared first on Mathew Kiang (.com).

I scraped data of all San Francisco public elementary schools for parents figuring out the school lottery system

MV Kiang — Sun, 01 Mar 2020 17:09:26 +0000

The post I scraped data of all San Francisco public elementary schools for parents figuring out the school lottery system appeared first on Mathew Kiang (.com).

Collaboration network from 2010 to 2019

MV Kiang — Sat, 07 Dec 2019 07:19:45 +0000

I have been trying to wrap my head around working with temporal networks — not just simple edge activation that changes over time but also evolving node attributes and nodes that may appear and disappear at random. What better way than to work with a small concrete example I’m already very familiar with?

Here is an update to a post I made, a little over a year ago, about my collaboration network. Each paper or project (blue) is connected to a collaborator (red). The size of the blue node is the cumulative citations that paper has received since publication to the current year (upper left).

Code available at this gist (note that it’s two files).

The post Collaboration network from 2010 to 2019 appeared first on Mathew Kiang (.com).

Quick look at NIH K-award funding

MV Kiang — Wed, 12 Dec 2018 23:08:36 +0000

Motivated by a chat with Maria Glymour, I took a quick look at NIH K-award funding rates. It’s a very exploratory/descriptive look, but all the code is up on my GitHub. I’m hoping to find time to dive into the data more at some point.

Just putting it here, with no commentary, in case others who are applying for K’s might find it useful.

UPDATE: Since this post, I applied for, and received, a K99 — check out that blog for more up to date numbers and a new Shiny app.

Overall, how much money does the NIH award through these early career mechanisms?

What is the breakdown of K-award funding by institute/center?Which mechanisms provide the most funding, by institute/center?

Which mechanisms/centers receive the most applications and have the highest success rate?

How has the success rate and funding amount changed over time for the “big K’s”?

Conclusion

I might add more analysis or make a Shiny app in the future. If I do, the GitHub repo will be the place to look.

The post Quick look at NIH K-award funding appeared first on Mathew Kiang (.com).

My Collaboration Network

MV Kiang — Sun, 17 Jun 2018 13:51:00 +0000

My Twitter timeline is blowing up with #NetSci2018 tweets and awesome visualizations this week, so I was inspired to see if I can quickly make my own “gratuitous collaboration graph” (as Dan would say).

Hover over each node to see the name of the paper (red), co-author (blue), or other project (green for data and orange for software).

(Source page.)

(NOTE: Apparently, this won’t render properly on mobile devices. Sorry.)

Turns out it’s straightforward in R to pull from Google Scholar (via the scholar package) and visualize your collaboration network (via visNetwork). I’ve posted all the code up on a gist and it’s fairly self-explanatory so I won’t go into that part.

Random thoughts and things that surprised me:

The completely disconnected components roughly correspond to different positions, research phases, or institutions. There are a lot more components than I had anticipated.
In the center component is my time spent as a research assistant for Sara Singer in health policy and management. Sara is the hub of that component with spokes out to different groups of people. Not surprising given her work on interdisciplinary teams.
In the middle-right is my previous (and still somewhat on-going) work with Nancy Krieger, Jarvis Chen, Brent Coull, and Jason Beckfield. They form a tight cluster and work together often (conditional on me also being a co-author on those projects).
Middle-left is my time at NYU with Joseph Palamar and Perry Halkitis. Perry was head of a research center so he is fairly central in that component while Joey was my mentor (and Perry’s student) so all papers with Joey also have Perry on them.
Then there are a few early collaborations — a couple stars and a line. Groups of people I don’t typically work with and/or am just starting to publish with.
(Then there’s the isolate for an R package related to one of my dissertation papers.)
It’s actually not as clear as I expected how I got from one component to another. Though that will likely change as many components will be connected as soon as one of my dissertation papers gets published.
Maybe I really do lack focus on my research projects.

The post My Collaboration Network appeared first on Mathew Kiang (.com).

I made an R package for working with NCHS multiple cause of death data.

MV Kiang — Mon, 13 Nov 2017 15:57:02 +0000

The post I made an R package for working with NCHS multiple cause of death data. appeared first on Mathew Kiang (.com).