I recently worked on a project where I needed to pull in a reasonably sized data set only to filter away a large majority of it. Needless to say the memory implications for scaling this up were concerning. So here's a quick demo of a problem where we want to pull in a list of all the words that start with "B" from a dictionary. The two methods are 1) the original pull in the dictionary then filter down to the "B"'s and 2) filter the data on ingress. Option 2)'s maximum memory consumption is approximately 6% of option 1.
library(pryr)
DICTIONARY <- '/etc/dictionaries-common/words'
printf <- function(...)print(sprintf(...))
# Read in the entire dictionary then filter it down to the 'B' words.
post_filter <- function() {
# Pull in the entire dictionary
full_dictionary <- read.table(DICTIONARY, col.names=c('words'),
colClasses=c('character'))[, 1]
print('The dictionary starts with the words...')
print(head(full_dictionary))
printf('The full dictionary object is %s Bytes', object_size(full_dictionary))
b_words <- grep("^b",full_dictionary, value=TRUE, ignore.case=TRUE)
printf('There are %d words that start with `B`', length(b_words))
printf('This subset of `B` words is now %s Bytes', object_size(b_words))
}
post_filter()
[1] "-----------------------------------------------"
[1] " Get the full dictionary, then filter"
[1] "-----------------------------------------------"
[1] "The dictionary starts with the words..."
[1] "A" "A's" "AA's" "AB's" "ABM's" "AC's"
[1] "The full dictionary object is 6059640 Bytes"
[1] "There are 6068 words that start with `B`"
[1] "This subset of `B` words is now 366832 Bytes"
library(pryr)
DICTIONARY <- '/etc/dictionaries-common/words'
printf <- function(...)print(sprintf(...))
pre_filter <- function() {
b_pipe <- pipe(paste("grep -i '^B'", DICTIONARY,sep=' '))
b_words <- read.table(b_pipe, col.names=c('words'),
colClasses=c('character'))[, 1]
printf('There are %d words that start with `B`', length(b_words))
printf('The set of `B` words is %s Bytes', object_size(b_words))
}
[1] "-----------------------------------------------"
[1] " Filter the data on the way in with pipe"
[1] "-----------------------------------------------"
[1] "There are 6068 words that start with `B`"
[1] "The set of `B` words is 366832 Bytes"
So both methods end up with a character vector with 6,068 "B" words, but the post-filter option bloats to an object of over 6MB while the pre-filter method stays at 367kB.