Not Public Health Archives - Mathew Kiang (.com)

Plots of my biking in 2022

MV Kiang — Sun, 15 Jan 2023 00:04:55 +0000

One of the best things about living in California is having amazing weather nearly all year long (the current 3-week-stretch-of-non-stop-rain aside). So last year, I decided to capitalize on the weather and made a New Year’s resolution to bike outdoors more. Specifically, I wanted to bike 1,500 miles outdoors in addition to my normal indoor biking of 2,500 miles. (Also, with a side quest of 100,000 feet of cumulative elevation gain.)

Below is a plot of my cumulative distance (and elevation) over the course of the year. I just barely got the distance resolutions with 1,551.7 miles outdoors and 2,506.1 indoors. I missed the elevation resolution by about 16,000 feet (ending at 83,899 feet).

Red represents Peloton rides while blue represents outdoor rides. The vertical shaded areas represent periods where I was away from home or unable to bike. In September, I got COVID, which kept me off the bike for weeks and way behind schedule on my miles. Even when I did manage to get back on the bike, I was unable to do particularly long or intense rides for a few more weeks. Brutal.

Anyways, as a San Diego native, I never really understood the hype around San Francisco — it just seemed crowded with dirty beaches. It’s still crowded with dirty beaches but biking has given me a much greater appreciation for the city. It’s surreal to ride between skyscrapers, then along the Bay overlooking Alcatraz, then through the the giant pine trees of the Presidio, over the Golden Gate Bridge, and along the cliffs overlooking the Pacific — all in a single ride. Below, is a sample of 36 rides around San Francisco.

A really nice aspect of riding in San Francisco is that you don’t really need a plan. There are enough routes that are connected to things worth checking out that you can just wing it as you ride and change your route depending on how your legs feel that day. This made me curious though. Which areas provide the most options? What areas do I tend to always string together? How many more miles (or feet of climbing) does going down a different branch add? To get at some of these questions, I took different parts of San Francisco and outline them on a map (left) as well as arbitrarily relocate them on a network representation (right). Note that the color gradient is roughly by latitude, but this doesn’t translate to the network representation.

Below, we can then take a subset of rides that start and end in San Francisco (left) and plot them as a network (right) where the nodes still represent geographic areas, the size and transparency of the node represents the number of visits (in-degree), and the edges represent my biking transitions from one area to another (darker means more of those transitions).

Here is a popular route called a Butterlap. It’s beautiful and my go-to route when showing out-of-towners around. It lets them see the main San Francisco spots (the Bay Bridge, the Ferry Building, Fisherman’s Wharf, Crissy Field, Alcatraz, Golden Gate Bridge, the Presidio, Land’s End, the Legion of Honor, Ocean Beach, Golden Gate Park, and downtown SF) and is reasonably flat.

If, in the middle of the ride, you decide you have another 10 miles in you, you can quickly convert this to (what I’ve called) the Butter Lake, which involves a loop around Lake Merced to the south of the city.

My favorite route in San Francisco involves crossing the Golden Gate Bridge and going up Hawk Hill. Hawk Hill is *the* classic SF climb and if you go in the morning on weekdays, it’s not uncommon to see pros of team training. It’s a great climb with stunning views and a fun, fast descent. In our two representations, the route looks like this.

But there are many days when I get to the bottom of Hawk Hill and my legs decide they just don’t have any climbing in them. So instead, you can add 50 miles and do a Paradise + China Camp loop.

Networks are a useful way of identifying and visualizing these types of decision points. Some more network-based bike metrics to come once I gather more data — any excuse for a few more rides.

The post Plots of my biking in 2022 appeared first on Mathew Kiang (.com).

It finally happened — I got COVID

MV Kiang — Wed, 07 Dec 2022 23:02:20 +0000

Last September, I got COVID. It was wildly unpleasant with serious brain fog that lasted for several weeks even after the other symptoms went away. That said, this did give me the opportunity to make some more plots based on my own data. Below, I show a few metrics of my vital signs (respiratory rate, heart rate, heart rate variability, and body temperature deviation) relative to my exposure (vertical dotted line) for six weeks before and after. The thicker grey lines in the background are the pre- and post-exposure averages for those six weeks.

As you can see, for a few things, even six weeks after exposure, I did not return to my pre-exposure baseline. My respiratory rate was slightly lower, my average heart rate was (and actually still remains) slightly elevated, and my heart rate variability is still lower (higher is better). My temperature is more or less the same.

All of this resulted in decreased physical activity, which I plot below.

I eventually went back to my baseline level of physical activity for all different metrics, but as you can see in the MET minutes metrics, there was a fairly long period of inactivity where it felt like my heart was not ready for intense exercise.

So, COVID-19: 1/10 — would not recommend.

The post It finally happened — I got COVID appeared first on Mathew Kiang (.com).

Student’s Tay Distribution

MV Kiang — Sun, 27 Dec 2020 20:42:17 +0000

Taylor Swift has recorded 9 albums, each of them (except the most recent) has gone multi-platinum. In total, she has sold over 200 million records, won 10 Grammy’s, an Emmy, 32 AMA’s, and 23 Billboard Music Awards. Not bad for somebody who just turned 31.

This year, she’s managed to release two albums — they’re both very good. However, I noticed there seemed to be more profanity than I had remembered on her older albums. Here, I’ll use tidytext to see if she has actually increased her rate of profanity or if I’m simply misremembering things.

Descriptives

We’ll begin with some simple descriptives. How many words (y-axis) does each album (x-axis) have? Note, the x-axis uses albums as references but ticks are are spaced temporally. Below, we show all words (green) as well as after removing stop words (orange) and counting only distinct words (purple). It appears like the number of distinct words has more or less stayed the same throughout the discography while the number of total words increased a bit and then decreased.

Perhaps this increase in the middle is explained to having more tracks on each album? Below we plot the average number of words per track (green) and the average number of distinct words per track (purple). Between the two plots, I think a likely explanation for the increase in words in the middle albums is due to more repetition and not due to more tracks.

Sentiment

How has sentiment changed over time? That is, have albums gotten more or less sad or happy? Is there a pattern such that the albums get more (or less) happy (or sad) as the album progresses? Below, I plot the average sentiment (y-axis) for each track (x-axis) by album (facets). The lines are fitted generalized additive models.

The sentiment across albums appears more or less flat (i.e., tracks do not get more or less positive as the album progresses). The exception is 1989, where the first track is “Welcome to New York” and “welcome” is classified as “positive”. In addition, the sixth track is “Shake it off” and “hater” (repeated in the chorus) is classified as “negative”. Excluding these two tracks results in a flat line.

Given there appears to be no relationship with average sentiment and track position, maybe albums themselves (across all lyrics and tracks) have different distributions of sentiment? Below is the density of sentiment for individual lyrics (i.e., lines) by album.

Most lyrics (in all albums) are fairly neutral. Reputation is especially pointy (yes, that’s the statistical term), but the last three albums have a bit more mass on both sides of 0 than previous albums.

Profanity

Ok. So now the main question. Has profanity increased over time? Below is the rate of profanity per 1,000 words (y-axis) over time/album (x-axis) by word (colors).

Yes, there’s more profanity in the recent albums than in the older albums. Maybe 2020 is getting to TSwift too? Check out Folklore and Evermore if you haven’t already.

(Code is here.)

The post Student’s Tay Distribution appeared first on Mathew Kiang (.com).

I wrote a simulation paper about playing Candy Land with toddlers

MV Kiang — Tue, 08 Sep 2020 01:38:40 +0000

The post I wrote a simulation paper about playing Candy Land with toddlers appeared first on Mathew Kiang (.com).

Collaboration network from 2010 to 2019

MV Kiang — Sat, 07 Dec 2019 07:19:45 +0000

I have been trying to wrap my head around working with temporal networks — not just simple edge activation that changes over time but also evolving node attributes and nodes that may appear and disappear at random. What better way than to work with a small concrete example I’m already very familiar with?

Here is an update to a post I made, a little over a year ago, about my collaboration network. Each paper or project (blue) is connected to a collaborator (red). The size of the blue node is the cumulative citations that paper has received since publication to the current year (upper left).

Code available at this gist (note that it’s two files).

The post Collaboration network from 2010 to 2019 appeared first on Mathew Kiang (.com).

tldr; San Diego weather is better than Boston weather

MV Kiang — Sun, 03 Dec 2017 01:17:55 +0000

I am taking a break from a crazy couple months of writing and coding by… writing code. Just a quick post comparing weather in Boston (where I am) to weather in San Diego (where I’m from).

While the New York Times may have made the original, most data viz people will recognize the plot above from Tufte’s classic, Visual Display of Quantitative Information. It presents a ton of data in a clear, concise, and appealing way. The background bars show the record high and low daily temperature, the mid-ground bars show the “normal” (though as far as I can tell, normal is never clearly defined) high and low temperature, and the foreground shows the high and low for that year. In addition, we have annotations for days that met or made the record. The original plot even had a subplot for daily precipitation.

Here is a similar plot for Boston:

From all the dots, you can tell it’s been a hot year in San Diego. Boston, on the other hand, is unsurprisingly variable. The joke around here is if you don’t like the weather, you just need to wait a minute.

How about just average high and lows?

Let’s clear out the unnecessary data and just compare the expected highs and lows:

It’s not a competition, but if it were, San Diego would win. I didn’t bother putting a legend on here because… Boston.

How about rain and snow?

Let’s compare cumulative precipitation:

Each line is one year. Darker colors are more recent years.

+1 San Diego.

Worse part of Boston weather:

Despite the rain/snow and the crazy annual temperature range, what bothers me most about Boston weather is the insane daily variability. Below is a plot of daily temperature change for all 50 years. Again, each line is a year and darker lines are more recent years.

+1 San Diego.

Conclusion

San Diego has better weather than Boston.

The full code is at this gist.

The post tldr; San Diego weather is better than Boston weather appeared first on Mathew Kiang (.com).

Using R, Wikipedia, and SHERPA/RoMEO to show New England Journal of Medicine‘s pre-print statement is empirically false

MV Kiang — Mon, 09 Oct 2017 00:17:57 +0000

One of the most fundamental aspects of collaborative research is sharing your work with others through pre-print or conference presentations. This isn’t likely to be news to anybody doing collaborative research these days, and many journals have become increasingly permissive with their pre-print policy. For example, Nature released an editorial making it clear, “Nature never wishes to stand in the way of communication between researchers.[…] Communication between researchers includes not only conferences but also preprint servers. The ArXiv preprint server is the medium of choice for (mainly) physicists and astronomers who wish to share drafts of their papers with their colleagues, and with anyone else with sufficient time and knowledge to navigate it. […] If scientists wish to display drafts of their research papers on an established preprint server before or during submission to Nature or any Nature journal, that’s fine by us.”¹ Other prestigious journals have similar policies—for example, The Lancet, Science, PNAS, and BMJ. (The list goes on and on.)

One such journal ~~does~~ did not. New England Journal of Medicine (Figure 1).

UPDATE: Since this post, NEJM has changed their position and pre-prints are allowed.

Figure 1: NEJM’s Pre-print policy. Accessed 9/6/2017.

At the end of their statement, NEJM seeks to comfort authors by assuring them that, “Most medical journals have similar [no pre-print] rules in place”.

Using R, Wikipedia’s List of Medical Journals, and the SHERPA/RoMEO database, we can empirically show this statement to be false.

Defining “most”

Merriam-Webster defines the word “most” as:

2 :the majority of

Defining the problem

Using the M-W definition, we’re going to show that “the majority of” medical journals do not have the same strict no-pre-print policy. That is, given a comprehensive list of medical journals, the majority of them will have more lenient pre-print policies than NEJM.

We will operationalize this with the SHERPA/RoMEO categorization:

Green. Can archive pre-print and post-print or publisher’s version/PDF.
Blue. Can archive post-print (i.e., final draft post-refereeing) or publisher’s version/PDF
Yellow. Can archive pre-print (i.e., pre-refereeing).
White. Archiving not formally supported.
Gray. Unknown.

NEJM is RoMEO white. We will show that more than 50% of journals are RoMEO green or yellow.

A list of medical journals

Using the rvest package in R, we can quickly scrape the Wikipedia page of medical journals and extract the relevant table:

library(rvest) 
library(tidyverse) 

list_url <- "https://en.wikipedia.org/wiki/List_of_medical_journals" 
list_df <- list_url %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    html_table(fill = TRUE) %>% 
    .[[1]]

This returns this output:

> glimpse(list_df)
Observations: 308
Variables: 5
$ Name                 "Academic Medicine", "ACIMED", "Acta Anaesthesiolo...
$ Specialty            "Academic medicine", "Medical informatics", "Anaes...
$ Publisher            "Association of American Medical Colleges", "Natio...
$ English              "English", "Spanish", "English", "Portuguese", "En...
$ `Publication Dates`  "1926-present", "1993-present", "1957-present", "1...

The table has the name of the journal, the medical specialty, the publisher, the language, and the journal dates.

Getting pre-print data

Unlike NEJM, SHERPA/RoMEO is completely free, open, and provides a useful API. Using this API, we will loop through all of our journals from the list above and find the RoMEO color for each one.

## Make new columns (store ISSN for verification in future)
romeo_df <- list_df %>% 
    mutate(romeo = NA, 
           issn = NA, 
           api_outcome = NA)
    
for (i in seq_along(romeo_df$Name)) {
    print(romeo_df$Name[i])
    
    api_url <- "http://www.sherpa.ac.uk/romeo/api29.php?jtitle="
    api_req <- gsub(" ", "%20", sprintf("%s%s", api_url, romeo_df$Name[i]))
    request <- read_xml(api_req)
    
    temp_issn <- request %>% 
        xml_node("issn") %>% 
        xml_text()
    temp_color <- request %>% 
        xml_node("romeocolour") %>% 
        xml_text()
    
    romeo_df$issn[i]     <- ifelse(length(temp_issn > 0), temp_issn, NA)
    romeo_df$romeo[i]    <- ifelse(length(temp_color > 0), temp_color, NA)
    romeo_df$api_outcome <- request %>% xml_node("outcome") %>% xml_text()
}

Now we have a new dataframe with RoMEO color (and a couple other variables):

> glimpse(romeo_df, width = 79)
Observations: 308
Variables: 8
$ Name                 "Academic Medicine", "ACIMED", "Acta Anaesthes...
$ Specialty            "Academic medicine", "Medical informatics", "A...
$ Publisher            "Association of American Medical Colleges", "N...
$ English              "English", "Spanish", "English", "Portuguese",...
$ `Publication Dates`  "1926-present", "1993-present", "1957-present"...
$ romeo                "yellow", NA, "yellow", "green", "yellow", "gr...
$ issn                 "1040-2446", NA, "0001-5172", "0870-399X", "00...
$ api_outcome          "singleJournal", "singleJournal", "singleJourn...

Results

Of the 308 medical journals in our list, do more than half have RoMEO colors green or yellow?

> mean(romeo_df$romeo %in% c("green", "yellow"))  > .5
[1] TRUE

Yes.

The full table is here:

> table(romeo_df$romeo, useNA = "always")

  blue   gray  green  white yellow    
    17     16    119     28     61     67

Thus, NEJM‘s pre-print statement is empirically false. It is possible to get a more accurate estimate of the number of medical journals that allow pre-prints; however, even taking the most conservative approach and allowing that all unknown (5.2%; N=15) or missing values (21.3%; N=67) have the same pre-print policy as NEJM, we show that their statement is factually incorrect.

Conclusion

While NEJM is obviously entitled to hold whatever pre-publication policy they want, suggesting that “most medical journals have similar policies” is verifiably incorrect. Even when taking the most conservative possible estimate (i.e., all NA values are assigned values against the null), NEJM‘s statement is unjustified. Further, other high impact journals have adopted ~~lenient~~ common sense pre-print policies. One can only dream that they will update their editorial response or perhaps update their discussion (last revision was in 1991). Perhaps at the very least, NEJM can update their pre-publication policy page to remove their factually incorrect statement.

Data and code

You can find the data here and the code here.

Show 1 footnote

The post Using R, Wikipedia, and SHERPA/RoMEO to show New England Journal of Medicine‘s pre-print statement is empirically false appeared first on Mathew Kiang (.com).

On graduate student burnout: “It isn’t usually a snap so much as a gradual disintegration.”

MV Kiang — Sat, 07 Jan 2017 15:07:20 +0000

The post On graduate student burnout: “It isn’t usually a snap so much as a gradual disintegration.” appeared first on Mathew Kiang (.com).

Reporter posts his mobile phone metadata for the public to analyze

MV Kiang — Mon, 17 Aug 2015 13:19:14 +0000

With some cool visualizations as a bonus. Dataset link is at the bottom of the page.

The post Reporter posts his mobile phone metadata for the public to analyze appeared first on Mathew Kiang (.com).

New MetroCard rates and the dreaded “dead zone of change”

MV Kiang — Mon, 23 Mar 2015 22:25:56 +0000

Yesterday, new MTA MetroCard rates went live in New York. I visit often enough that I thought I’d put together a little chart to figure out the best amount of money to put on my MetroCard depending on the number of rides I need — thus helping me avoid the dead zone of change and a stack of useless MetroCards. The chart shows you how much to put in (by the number of rides you need) and in parenthesis tells you how much change you will have after using all the rides.

If you live in the city, $22.30 is the sweet spot with 9 rides and zero change at the end of it.

My parameters might be a little outdated, but when I lived in New York, deposits had to be in $0.05 increments. As far as I know, you don’t get a bonus until you deposit at least $5 and the bonus is now 11%. Code is here — adjust as needed.

The post New MetroCard rates and the dreaded “dead zone of change” appeared first on Mathew Kiang (.com).