csv files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread(), read_csv, or the like can be fine, but nothing is quite as fast as using awk or cat.
Here’s a snippet of code that allows one to use bash in R to concatenate csv files in a directory. People in the lab have found it helpful so maybe others will as well.
## Merge a file using bash before importing into R
bash_merge <- function(folder, awk = TRUE, joined_file = "0000-merged.csv") {
# Uses bash `awk` or `cat` to merge files.
# In general, faster than looping `fread()` for `read_csv`.
# Note: `cat` doesn't work if there is a header row.
original_wd <- getwd()
setwd(folder)
if (awk){
system(paste0("awk 'FNR==1 && NR!=1{next;}{print}' *.csv > ",
joined_file))
} else {
system(paste0("cat *.csv > ", joined_file))
}
setwd(original_wd)
}
It can obviously be modified for any type of file. I have no evidence to back this up, but in a typical use case, I get at least 100 times speedup compared to fread() or read_csv loops. Reading in one big file is almost always faster than reading in thousands of little ones and reallocating more memory as you go.