Using R, Wikipedia, and SHERPA/RoMEO to show New England Journal of Medicine‘s pre-print statement is empirically false

One of the most fundamental aspects of collaborative research is sharing your work with others through pre-print or conference presentations. This isn’t likely to be news to anybody doing collaborative research these days, and many journals have become increasingly permissive with their pre-print policy. For example, Nature released an editorial making it clear, “Nature never wishes to stand in the way of communication between researchers.[…] Communication between researchers includes not only conferences but also preprint servers. The ArXiv preprint server is the medium of choice for (mainly) physicists and astronomers who wish to share drafts of their papers with their colleagues, and with anyone else with sufficient time and knowledge to navigate it. […] If scientists wish to display drafts of their research papers on an established preprint server before or during submission to Nature or any Nature journal, that’s fine by us.”1 Other prestigious journals have similar policies—for example, The Lancet, Science, PNAS, and BMJ. (The list goes on and on.)

One such journal does did not. New England Journal of Medicine (Figure 1).

UPDATE: Since this post, NEJM has changed their position and pre-prints are allowed.

Figure 1: NEJM’s Pre-print policy. Accessed 9/6/2017.

At the end of their statement, NEJM seeks to comfort authors by assuring them that, “Most medical journals have similar [no pre-print] rules in place”.

Using R, Wikipedia’s List of Medical Journals, and the SHERPA/RoMEO database, we can empirically show this statement to be false.

Defining “most”

Merriam-Webster defines the word “most” as:

2 :the majority of

Defining the problem

Using the M-W definition, we’re going to show that “the majority of” medical journals do not have the same strict no-pre-print policy. That is, given a comprehensive list of medical journals, the majority of them will have more lenient pre-print policies than NEJM.

We will operationalize this with the SHERPA/RoMEO categorization:

  • Green. Can archive pre-print and post-print or publisher’s version/PDF.
  • Blue. Can archive post-print (i.e., final draft post-refereeing) or publisher’s version/PDF
  • Yellow. Can archive pre-print (i.e., pre-refereeing).
  • White. Archiving not formally supported.
  • Gray. Unknown.

NEJM is RoMEO white. We will show that more than 50% of journals are RoMEO green or yellow.

A list of medical journals

Using the rvest package in R, we can quickly scrape the Wikipedia page of medical journals and extract the relevant table:

library(rvest) 
library(tidyverse) 

list_url <- "https://en.wikipedia.org/wiki/List_of_medical_journals" 
list_df <- list_url %>% 
    read_html() %>% 
    html_nodes("table") %>% 
    html_table(fill = TRUE) %>% 
    .[[1]]

 

<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif;">This returns this output:</span>
> glimpse(list_df)
Observations: 308
Variables: 5
$ Name                <chr> "Academic Medicine", "ACIMED", "Acta Anaesthesiolo...
$ Specialty           <chr> "Academic medicine", "Medical informatics", "Anaes...
$ Publisher           <chr> "Association of American Medical Colleges", "Natio...
$ English             <chr> "English", "Spanish", "English", "Portuguese", "En...
$ `Publication Dates` <chr> "1926-present", "1993-present", "1957-present", "1...

The table has the name of the journal, the medical specialty, the publisher, the language, and the journal dates.

Getting pre-print data

Unlike NEJM, SHERPA/RoMEO is completely free, open, and provides a useful API. Using this API, we will loop through all of our journals from the list above and find the RoMEO color for each one.

## Make new columns (store ISSN for verification in future)
romeo_df <- list_df %>% 
    mutate(romeo = NA, 
           issn = NA, 
           api_outcome = NA)
    
for (i in seq_along(romeo_df$Name)) {
    print(romeo_df$Name[i])
    
    api_url <- "http://www.sherpa.ac.uk/romeo/api29.php?jtitle="
    api_req <- gsub(" ", "%20", sprintf("%s%s", api_url, romeo_df$Name[i]))
    request <- read_xml(api_req)
    
    temp_issn <- request %>% 
        xml_node("issn") %>% 
        xml_text()
    temp_color <- request %>% 
        xml_node("romeocolour") %>% 
        xml_text()
    
    romeo_df$issn[i]     <- ifelse(length(temp_issn > 0), temp_issn, NA)
    romeo_df$romeo[i]    <- ifelse(length(temp_color > 0), temp_color, NA)
    romeo_df$api_outcome <- request %>% xml_node("outcome") %>% xml_text()
}

Now we have a new dataframe with RoMEO color (and a couple other variables):

> glimpse(romeo_df, width = 79)
Observations: 308
Variables: 8
$ Name                <chr> "Academic Medicine", "ACIMED", "Acta Anaesthes...
$ Specialty           <chr> "Academic medicine", "Medical informatics", "A...
$ Publisher           <chr> "Association of American Medical Colleges", "N...
$ English             <chr> "English", "Spanish", "English", "Portuguese",...
$ `Publication Dates` <chr> "1926-present", "1993-present", "1957-present"...
$ romeo               <chr> "yellow", NA, "yellow", "green", "yellow", "gr...
$ issn                <chr> "1040-2446", NA, "0001-5172", "0870-399X", "00...
$ api_outcome         <chr> "singleJournal", "singleJournal", "singleJourn...

Results

Of the 308 medical journals in our list, do more than half have RoMEO colors green or yellow?

> mean(romeo_df$romeo %in% c("green", "yellow"))  > .5
[1] TRUE

Yes.

The full table is here:

> table(romeo_df$romeo, useNA = "always")

  blue   gray  green  white yellow   <NA> 
    17     16    119     28     61     67

Thus, NEJM‘s pre-print statement is empirically false. It is possible to get a more accurate estimate of the number of medical journals that allow pre-prints; however, even taking the most conservative approach and allowing that all unknown (5.2%; N=15) or missing values (21.3%; N=67) have the same pre-print policy as NEJM, we show that their statement is factually incorrect.

Conclusion

While NEJM is obviously entitled to hold whatever pre-publication policy they want, suggesting that “most medical journals have similar policies” is verifiably incorrect. Even when taking the most conservative possible estimate (i.e., all NA values are assigned values against the null), NEJM‘s statement is unjustified. Further, other high impact journals have adopted lenient common sense pre-print policies. One can only dream that they will update their editorial response or perhaps update their discussion (last revision was in 1991). Perhaps at the very least, NEJM can update their pre-publication policy page to remove their factually incorrect statement.

Data and code

You can find the data here and the code here.