Skip to content

Instantly share code, notes, and snippets.

@chasemc
Last active January 18, 2021 21:04
Show Gist options
  • Save chasemc/2dcb1abfe11c5467ce4413fc1a908f20 to your computer and use it in GitHub Desktop.
Save chasemc/2dcb1abfe11c5467ce4413fc1a908f20 to your computer and use it in GitHub Desktop.
Count how many lines/rows contain X, from many files

Counting Lines

The Problem

I have a lot of files, in a lot of nested directories: 347430 directories, 379286 files

To anonymize what I’m doing we’ll say I have two types of files: apples.csv.gz and oranges.csv.gz (if you don’t know- “gz” means the file is compressed, with gzip)

Each apples.csv.gz has rows that contain either sour or sweet.

I want to know how many lines (rows) of sweet are found in all apples.csv.gz files. - Note: each apples.csv.gz has thousands of lines containing sweet and thousands of other lines

The Easy Part

The first step is to find the paths to all the apples.csv.gz files.

apple_paths <- list.files(path = "top/directory",
                          pattern = "*apples*",
                          recursive = TRUE,
                          full.names = TRUE)

The slow lapply

note: readLines() handles decompression before reading

res <- lapply(apple_paths, 
              function(x){
                temp <- readLines(x)
                temp <- grep("sweet", temp)
                temp <- length(temp)
                return(temp)
              })
sum(res)

This will take a really long time!

The Need for Speed

Some ways we could speed things up:

  • Use {data.table}’s fread() to read the files faster.

    • Pros
      • fast and files can be pre-filtered with grep
  • Make the function parallel.

We’re gonna mostly do the second and avoid reading the data into R at all.

Pipes

To increase speed we’re going to decompress, grep, and count outside R using zgrep and Unix’s wc (Unix: think Mac and Linux) What R will really help with here is running everything in parallel

library(parallel) # comes with R

# Use all cpus minus one... unless there's only one cpu
numCores <- parallel::detectCores()
numCores <- ifelse(numCores > 2,
                   numCores - 1,
                   1)    

res <- mclapply(apple_paths,
                function(x){
                  readLines(pipe(paste0("zgrep 'sweet' ", x, " | wc -l")))
                },
                mc.cores = numCores)

res <- unlist(res)
res <- as.integer(res)

sum(res)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment