I have a lot of files, in a lot of nested directories: 347430 directories, 379286 files
To anonymize what I’m doing we’ll say I have two types of files:
apples.csv.gz
and oranges.csv.gz
(if you don’t know- “gz” means the
file is compressed, with gzip
)
Each apples.csv.gz
has rows that contain either sour
or sweet
.
I want to know how many lines (rows) of sweet
are found in all
apples.csv.gz
files. - Note: each apples.csv.gz
has thousands of
lines containing sweet
and thousands of other lines
The first step is to find the paths to all the apples.csv.gz
files.
apple_paths <- list.files(path = "top/directory",
pattern = "*apples*",
recursive = TRUE,
full.names = TRUE)
note: readLines()
handles decompression before reading
res <- lapply(apple_paths,
function(x){
temp <- readLines(x)
temp <- grep("sweet", temp)
temp <- length(temp)
return(temp)
})
sum(res)
This will take a really long time!
Some ways we could speed things up:
-
Use
{data.table}
’sfread()
to read the files faster.- Pros
- fast and files can be pre-filtered with grep
- Pros
-
Make the function parallel.
We’re gonna mostly do the second and avoid reading the data into R at all.
To increase speed we’re going to decompress, grep, and count outside R
using zgrep
and Unix’s wc
(Unix: think Mac and Linux) What R will
really help with here is running everything in parallel
library(parallel) # comes with R
# Use all cpus minus one... unless there's only one cpu
numCores <- parallel::detectCores()
numCores <- ifelse(numCores > 2,
numCores - 1,
1)
res <- mclapply(apple_paths,
function(x){
readLines(pipe(paste0("zgrep 'sweet' ", x, " | wc -l")))
},
mc.cores = numCores)
res <- unlist(res)
res <- as.integer(res)
sum(res)