-
-
Save infotroph/dd0faa5fd24bb78b4ff6 to your computer and use it in GitHub Desktop.
# My input files have short header lines, then CSV data, then short footer lines. | |
# I'm currently trimming the short lines with an external call to sed, | |
# but I want a pure-R solution for portability. | |
# This version works nicely on small examples but gets very slow on large files, | |
# because append() grows the list, triggering a memory reallocation, for every line. | |
# Suggestions for speed improvement requested. | |
read.longline = function(file){ | |
f = file(file, "r") | |
lines = list() | |
repeat{ # read short headers & discard | |
l = readLines(f, n=1) | |
if(length(l) > 0 && nchar(l) > 65){ | |
# We've found the first data row. | |
# Leave it on the stack to process in the next loop. | |
pushBack(l, f) | |
break | |
} | |
} | |
repeat{ # read long lines, add to CSV, break when short lines start again | |
l = readLines(f, n=1) | |
if(length(l) > 0 && nchar(l) > 65){ | |
# Naive implementation! | |
# Likely to be VERY slow because we're growing lines every time. | |
lines = append(lines, l) | |
}else{ | |
# Either we've hit a short line == beginning of PGP block, | |
# or empty line == end of the file. | |
# Either way we're done. | |
break | |
} | |
} | |
close(f) | |
# Now stitch lines together into a dataframe | |
txtdat = do.call("paste", c(lines, sep="\n")) | |
return(read.csv(text=txtdat, stringsAsFactors=FALSE)) | |
} |
What about reading file into a vector of strings, then identify first line? For example (this is off the top of my head, I haven't run it):
fileContents <- readLines(f)
firstDataLine <- min(which(nchar(fileContents) > 65))
return(read.cv(connection(fileContents), skip=firstDataLine-1))
Is connection() part of a package I don't have, or pseudocode?
This seems to get me within percents instead of multiples of read.csv:
read.longlines = function(file){
f = file(file, "r")
f_lines = readLines(f)
wanted = which(nchar(f_lines) > 65)
firstData = min(wanted)
lastData = max(wanted)
return(read.csv(
text=do.call(
what="paste",
args=c(as.list(f_lines[firstData:lastData]), sep="\n")),
stringsAsFactors=FALSE))
}
It runs in less than ~1 minute for a 150 kline test file, which isn't blazing but is probably good enough for my current purposes. Thanks!
You're welcome. Sorry, I meant textConnection
. Here's some revised code--this time, I actually tested it!
fileContents <- readLines("test.txt")
firstDataLine <- min(which(nchar(fileContents) >= MIN_DATA_LINE_LENGTH))
lastDataLine <- max(which(nchar(fileContents) >= MIN_DATA_LINE_LENGTH))
con <- textConnection(fileContents)
read.table(con, skip=firstDataLine-1, nrows=lastDataLine-firstDataLine,sep=",",header=TRUE)
It took me a while to convince myself, but it looks as if pasting the lines together first is significantly faster than (at least this variant of) using a textConnection
: https://gist.github.com/infotroph/cec9a9fb0158530d817f is a self-contained demo comparing the speed of different reading approaches with no filtering.
On my machine, a few thousand lines take milliseconds with a paste
d string and minutes with a vector of lines, textConnection
or not. What especially surprises me is that both approaches have a similar memory footprint & the vector read seems to be entirely CPU-bound. No idea what it spends all those cycles on...
I'm still not sure if this is general or if my test case is pathological, but bottom line: If you came here looking to pass a vector of lines through a textConnection
into read.csv
, try paste
ing them all together as one string with internal newlines and passing that into readr::read_csv
instead; maybe you'll get a speedup.
Best suggestion so far: scan the file twice, once to find start and end lines and once with read.csv at those offsets:
This is still about 2x slower than read.csv() on a clean version of the same data, so I suspect without checking that ~all the bottlenecks of this version are in file reading.