Reading in multiple csv files as data frames and concatenating (or row binding) them into one data frame is a task we routinely face.
In R, there are many ways of doing it. But which is the best, and why? I think the best way is the simplest and most high level way; something that is easy to read and write and edit. Here are three variants of what I think is the right way. The first is close to a base R way (except for the use of read_csv
and the beloved pipe),the second uses purrr
and dplyr
, and the third just uses purrr
.
library(readr)
library(tibble)
library(purrr)
library(dplyr)
# Make K csv files, each with N rows --------------------------------------
K <- 100; N <- 1000
lapply(seq(K),
function(i){
data_df <- tibble(x = rnorm(N), y = rnorm(N), z = rnorm(N))
write_csv(data_df, paste0('data_', i, '.csv'))
}
)
# Get the file list -------------------------------------------------------
file_list <- list.files(pattern = 'data_[0-9]*.csv')
# Read in the data frames and concatenate them: Version 1 -----------------
data_df1 <- lapply(file_list, read_csv) %>%
do.call(rbind, .)
# Read in the data frames and concatenate them: Version 2 -----------------
data_df2 <- map(file_list, read_csv) %>%
bind_rows()
# Read in the data frames and concatenate them: Version 3 -----------------
data_df3 <- map_dfr(file_list, read_csv)
# But are the all the same? -----------------------------------------------
all_equal(data_df1, data_df2)
all_equal(data_df2, data_df3)
# Delete the files (not that you'd usually want to do this) ---------------
file.remove(file_list)