{"id":1278,"date":"2016-11-09T13:34:36","date_gmt":"2016-11-09T18:34:36","guid":{"rendered":"http:\/\/mathewkiang.com\/?p=1278"},"modified":"2020-01-11T16:11:27","modified_gmt":"2020-01-11T21:11:27","slug":"use-bash-to-concatenate-files-in-r","status":"publish","type":"post","link":"https:\/\/mathewkiangcom.local\/2016\/11\/09\/use-bash-to-concatenate-files-in-r\/","title":{"rendered":"Use bash to concatenate files in R"},"content":{"rendered":"O<\/span>ften, I find I need to loop through directories full of csv<\/code>\u00a0files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread()<\/code>, read_csv<\/code>, or the like can be fine, but nothing is quite as fast as using awk<\/code>\u00a0or cat<\/code>.<\/p>\n

Here’s a snippet of code that allows one to use bash<\/code> in\u00a0R<\/code>\u00a0to\u00a0concatenate csv<\/code>\u00a0files in a directory. People in the lab have found it helpful so maybe others will as well.<\/p>\n

<\/p>\n

## Merge a file using bash before importing into R\n\nbash_merge <- function(folder, awk = TRUE, joined_file = "0000-merged.csv") {\n    # Uses bash `awk` or `cat` to merge files. \n    # In general, faster than looping `fread()` for `read_csv`. \n    # Note: `cat` doesn't work if there is a header row.\n    original_wd <- getwd()\n    setwd(folder)\n    if (awk){\n        system(paste0("awk 'FNR==1 && NR!=1{next;}{print}' *.csv > ", \n                      joined_file))\n    } else {\n        system(paste0("cat *.csv > ", joined_file))\n    }\n    setwd(original_wd)\n}<\/pre>\n

It can obviously be modified for any type of file. I have no evidence to back this up, but in a typical use case, I get at least 100 times speedup compared to fread()<\/code>\u00a0or read_csv<\/code>\u00a0loops. Reading in one big file is almost always faster than reading in thousands of little ones and reallocating more memory as you go.<\/p>\n","protected":false},"excerpt":{"rendered":"

ften, I find I need to loop through directories full of csv\u00a0files, sometimes tens of thousands of them, in order to combine them into a single analytical dataset I can use. When it’s only a few dozen, using fread(), read_csv, or the like can be fine, but nothing is quite as fast as using awk\u00a0or cat. Here’s a snippet of code that allows one to use bash in\u00a0R\u00a0to\u00a0concatenate csv\u00a0files in a directory. People in the lab have found it helpful so maybe others will as well.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[132],"tags":[],"_links":{"self":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts\/1278"}],"collection":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/comments?post=1278"}],"version-history":[{"count":0,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/posts\/1278\/revisions"}],"wp:attachment":[{"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/media?parent=1278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/categories?post=1278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mathewkiangcom.local\/wp-json\/wp\/v2\/tags?post=1278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}