tldr; San Diego weather is better than Boston weather

I am taking a break from a crazy couple months of writing and coding by… writing code. Just a quick post comparing weather in Boston (where I am) to weather in San Diego (where I’m from).

While the New York Times may have made the original, most data viz people will recognize the plot above from Tufte’s classic, Visual Display of Quantitative Information. It presents a ton of data in a clear, concise, and appealing way. The background bars show the record high and low daily temperature, the mid-ground bars show the “normal” (though as far as I can tell, normal is never clearly defined) high and low temperature, and the foreground shows the high and low for that year. In addition, we have annotations for days that met or made the record. The original plot even had a subplot for daily precipitation.

Here is a similar plot for Boston:

Read More

Replacing decimal points with interpuncts in MS Word

It turns out Microsoft Word’s “Advanced Find and Replace” is quite… well, advanced. You can actually use regex to do relatively complex find and replace functions. For example, The Lancet requires that all decimal points be middle dots (i.e., interpuncts). This is pretty trivial in LaTeX or Rmd and turns out it’s equally easy in Word.

Just use ([0-9]{1})(\.)([0-9]{1}) as your search query and \1·\3 as your replacement with the “Use wildcards” option.

We (as a field) should still be moving over to doing our drafting in Rmd or LaTeX though. The bloat on MS Word makes working with moderate sized manuscripts with figures painful.

Read More

Using R, Wikipedia, and SHERPA/RoMEO to show New England Journal of Medicine’s pre-print statement is empirically false.

One of the most fundamental aspects of collaborative research is sharing your work with others through pre-print or conference presentations. This isn’t likely to be news to anybody doing collaborative research these days, and many journals have become increasingly permissive with their pre-print policy. For example, Nature released an editorial making it clear, “Nature never wishes to stand in the way of communication between researchers.[…] Communication between researchers includes not only conferences but also preprint servers. The ArXiv preprint server is the medium of choice for (mainly) physicists and astronomers who wish to share drafts of their papers with their colleagues, and with anyone else with sufficient time and knowledge to navigate it. […] If scientists wish to display drafts of their research papers on an established preprint server before or during submission to Nature or any Nature journal, that’s fine by us.”1 Other prestigious journals have similar policies—for example, The Lancet, Science, PNAS, and BMJ. (The list goes on and on.)

One such journal does not. New England Journal of Medicine (Figure 1).

Read More

Using a histogram as a legend in choropleths

Despite well known drawbacks,1 plotting parameters onto maps provides a convenient way of seeing context, patterns, and outliers. However, one of the many problems with choropleths is that the area of the regions tend to distort our perception of the value of the region. For example, in the United States, huge (in terms of land mass) counties will tend to have a greater visual impact than small counties (despite often having similar or even smaller population sizes).

One way to address this is to use a histogram as a legend on your map. The histogram then provides you with a way of showing raw counts of equal weights while the map allows you to provide the spatial context of the values.

Read More

Show 1 footnote

  1. E.g., Gelman and Price 1999 or How to Lie with Maps by Mark Monmonier

Getting SSL certificates on GoDaddy Shared Hosting plans

Since Google’s announcement that they will start publicly shaming unsecured websites in January 2017, everybody has been rushing to try to get their https tags. I’ve also been getting relentless phone calls from GoDaddy salespeople asking me to buy SSL certificates for about $5 per month. I’m stereotypically Asian so $5 per month on a personal blog just seems excessive. I’m not here trying to sell things or get your credit card information — SSL is a nice-to-have-but-not-$5-per-month-nice-to-have item.

Turns out, creating and making free SSL certificates is not that hard thanks to the good people at EFF. There’s a very helpful blog post by Isabel Castillo that outlines how to do it. Some issues I ran into and their solutions:

Read More

Use bash to concatenate files in R

Often, I find I need to loop through directories full of csv files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread(), read_csv, or the like can be fine, but nothing is quite as fast as using awk or cat.

Here’s a snippet of code that allows one to use bash in R to concatenate csv files in a directory. People in the lab have found it helpful so maybe others will as well.

Read More

A visual tour of my publications

I recently came across this paper by Michal Brzezinski about (the lack of) power laws in citation distributions. It made me a little curious about the citations of my own articles so I threw together a little script using James Keirstead’s Scholar package for R. In the plot above, every line represents a single article with time on the x-axis and (cumulative) number of citations on the y-axis.

It’s not super informative, so we can break it down a few ways to graphically explore the data.

Read More