Use bash to concatenate files in R

Often, I find I need to loop through directories full of csv files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread(), read_csv, or the like can be fine, but nothing is quite as fast as using awk or cat.

Here’s a snippet of code that allows one to use bash in R to concatenate csv files in a directory. People in the lab have found it helpful so maybe others will as well.

## Merge a file using bash before importing into R

bash_merge <- function(folder, awk = TRUE, joined_file = "0000-merged.csv") {
    # Uses bash `awk` or `cat` to merge files. 
    # In general, faster than looping `fread()` for `read_csv`. 
    # Note: `cat` doesn't work if there is a header row.
    original_wd <- getwd()
    setwd(folder)
    if (awk){
        system(paste0("awk 'FNR==1 && NR!=1{next;}{print}' *.csv > ", 
                      joined_file))
    } else {
        system(paste0("cat *.csv > ", joined_file))
    }
    setwd(original_wd)
}

It can obviously be modified for any type of file. I have no evidence to back this up, but in a typical use case, I get at least 100 times speedup compared to fread() or read_csv loops. Reading in one big file is almost always faster than reading in thousands of little ones and reallocating more memory as you go.