Skip to content

Instantly share code, notes, and snippets.

@jeroen
Last active January 27, 2016 13:11
Show Gist options
  • Save jeroen/d1e48644fa6044722b09 to your computer and use it in GitHub Desktop.
Save jeroen/d1e48644fa6044722b09 to your computer and use it in GitHub Desktop.
line-by-line import of json records with RCurl
batchimport <- function(url, batchsize = 1e6) {
# Globals
stack <- batches <- list();
i <- j <- 1;
size <- 0;
# Function that parses intermediate json
clearstack <- function(all=FALSE){
bigstr <- do.call(paste0, stack);
lines <- strsplit(bigstr, "\n")[[1]];
message("Adding ", length(lines)-1, " records.")
json <- paste0("[", paste0(if(all){lines}else{head(lines, -1)}, collapse=","), "]")
batches[[j]] <<- jsonlite::fromJSON(json, validate=TRUE);
j <<- j + 1;
stack <<- list(tail(lines, 1))
i <<- 2;
size <<- nchar(stack[[1]])
}
# Line by line reading in RCurl
RCurl::getURL(url = "https://jeroenooms.github.io/data/diamonds.json", write = function(str, ...) {
stack[[i]] <<- str;
size <<- size + nchar(str);
i <<- i+1;
if(size > batchsize) clearstack();
});
clearstack(TRUE);
jsonlite::rbind.pages(batches);
}
# Import lines, 1MB at a time.
test <- batchimport("https://jeroenooms.github.io/data/diamonds.json", 1e6)
nrow(test)
# compare with original
data(diamonds, package="ggplot2")
nrow(diamonds)
@sckott
Copy link

sckott commented Sep 7, 2014

very cool

@jeroen
Copy link
Author

jeroen commented Sep 8, 2014

Note that the data in this example contains of separate lines of json records, not a large single json blob. This is very common when data is exported from nosql dbs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment