csv
files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread()
, read_csv
, or the like can be fine, but nothing is quite as fast as using awk
or cat
.
Here’s a snippet of code that allows one to use bash
in R
to concatenate csv
files in a directory. People in the lab have found it helpful so maybe others will as well.
## Merge a file using bash before importing into R bash_merge <- function(folder, awk = TRUE, joined_file = "0000-merged.csv") { # Uses bash `awk` or `cat` to merge files. # In general, faster than looping `fread()` for `read_csv`. # Note: `cat` doesn't work if there is a header row. original_wd <- getwd() setwd(folder) if (awk){ system(paste0("awk 'FNR==1 && NR!=1{next;}{print}' *.csv > ", joined_file)) } else { system(paste0("cat *.csv > ", joined_file)) } setwd(original_wd) }
It can obviously be modified for any type of file. I have no evidence to back this up, but in a typical use case, I get at least 100 times speedup compared to fread()
or read_csv
loops. Reading in one big file is almost always faster than reading in thousands of little ones and reallocating more memory as you go.