r``{r, echo = FALSE} library(microbenchmark) options(digits = 3)
# Reading a complete file with R
This is a short exploration of the most efficient way to read a complete file
(including newlines) into R - previously I'd used `readLines()` plus `paste()`
but that's clearly the least efficient option.
Here are the options:
* Use `readLines()` and `paste()`
```r
read_file1 <- function(path) {
paste0(paste0(readLines(path), collapse = "\n"), "\n")
}
```
* Find out the size of the file and then use `readChar()`
```r
read_file2 <- function(path) {
size <- file.info(path)$size
readChar(path, size, useBytes = TRUE)
}
```
* As above, but using `readBin()`, then converting to a character vector.
Unfortunately you can't read into a character vector directly because
use `type = "character"` is limited to 10000 characters
```r
read_file3 <- function(path) {
size <- file.info(path)$size
rawToChar(readBin(path, "raw", size))
}
```
* A safer approach that doesn't use a separate call to `file.info()` - this avoids race conditions where the file changes between asking for its size and reading it. (Suggested by [@klmr](http://twitter.com/klmr))
```r
read_file4 <- function(path, chunk_size = 10000) {
con <- file(path, "rb", raw = TRUE)
on.exit(close(con))
# Guess approximate number of chunks
n <- file.info(path)$size/chunk_size
chunks <- vector("list", n)
i <- 1L
chunks[[i]] <- readBin(con, "raw", n = chunk_size)
while (length(chunks[[i]]) == chunk_size) {
i <- i + 1L
chunks[[i]] <- readBin(con, "raw", n = chunk_size)
}
rawToChar(unlist(chunks, use.names = FALSE))
}
```
* An alternative would be to use C++. This version was supplied by [@tim_yates](http://twitter.com/tim_yates/status/372369074019258370)
```r
library(Rcpp)
sourceCpp("read-file.cpp")
```
We'll compare the results on a file included with R:
```r
path <- file.path(R.home("doc"), "COPYING")
file.info(path)$size/1024
## [1] 17.59
First we need to check they all return the same results. (They won't if the file doesn't include a trailing newline)
stopifnot(identical(read_file1(path), read_file2(path)))
stopifnot(identical(read_file1(path), read_file3(path)))
stopifnot(identical(read_file1(path), read_file4(path)))
stopifnot(identical(read_file1(path), read_file_cpp(path)))
stopifnot(identical(read_file1(path), read_file_cpp2(path)))
The benchmarking results are clear: readChar()
is the best base R option, and is
about four times faster for this file. The safer approach using chunked readBin()
reads is about 50% slower. The C++ function is both fast (2x faster than readChar()
and 7x faster than readLines()
) and safe.
library(microbenchmark)
microbenchmark(readLines = read_file1(path), readChar = read_file2(path), readBin = read_file3(path),
chunked_read = read_file4(path), Rcpp = read_file_cpp(path), Rcpp2 = read_file_cpp2(path))
## Unit: microseconds
## expr min lq median uq max neval
## readLines 1415 1430.5 1442.0 1565.0 1631 100
## readChar 213 225.5 232.0 244.5 2708 100
## readBin 233 244.0 254.0 265.5 478 100
## chunked_read 325 335.0 345.5 360.0 391 100
## Rcpp 82 90.0 100.0 109.0 182 100
## Rcpp2 73 76.0 84.0 89.5 114 100
Same comment I made to Tim -- change the signature to CharacterVector read_file_cpp2(std::string fname) to skip one more assignment.