Skip to content

Instantly share code, notes, and snippets.

@peterhurford
Last active November 5, 2024 08:13
Show Gist options
  • Save peterhurford/0d62f49fd43b6cf078168c043412f70a to your computer and use it in GitHub Desktop.
Save peterhurford/0d62f49fd43b6cf078168c043412f70a to your computer and use it in GitHub Desktop.
What's the fastest way to determine the number of rows of a CSV in R?
# What's the fastest way to determine the number of rows of a CSV in R?
# ...Reading the entire CSV to only get the dimensions is likely too slow. Is there a faster way?
# Benchmarks done on a EC2 r3.8xlarge
# Cowritten with Abel Castillo <github.com/abelcastilloavant>
m <- 1000000
d <- data.frame(id = seq(m), a = rnorm(m), b = runif(m))
dim(d)
# [1] 1000000 3
pryr::object_size(d)
# 20 MB
readr::write_csv(d, "tmp.csv")
microbenchmark::microbenchmark(
{lines <- 0; f <- file("tmp.csv", "r"); while (TRUE) {
line <- readLines(f, n = 1)
if (length(line) == 0) { break }; lines <- lines + 1}
print(lines - 1) }, # 2784.9ms
{ sqldf::read.csv.sql("tmp.csv", "select count(*) from file")[[1]] }, # 2103.0ms
{ length(readLines("tmp.csv")) - 1 }, # 1750.3ms
{ length(count.fields("tmp.csv")) - 1 }, # 1519.1ms
{ R.utils::countLines("tmp.csv")[[1]] - 1 }, # 1071.3ms
{ dim(data.table::fread("tmp.csv"))[[1]] }, # 493.4ms
{ NROW(data.table::fread("tmp.csv")) }, # 472.7ms
{ dim(readr::read_csv("tmp.csv"))[[1]] }, # 414.4ms
{ NROW(readr::read_csv("tmp.csv")) }, # 391.7ms
{ length(readr::count_fields("tmp.csv", tokenizer = readr::tokenizer_csv())) - 1 }, # 254.8ms
{ as.integer(strsplit(system("wc -l tmp.csv", intern = TRUE), " ")[[1]][[1]]) - 1 }, # 24.9ms
{ as.numeric(system("cat tmp.csv | wc -l", intern = TRUE)) - 1 }, # 17.9ms
times = 4)
@randy3k
Copy link

randy3k commented Jun 5, 2020

Wow, vroom::vroom_lines is close to wc -l.

m <- 1000000
d <- data.frame(id = seq(m), a = rnorm(m), b = runif(m))

readr::write_csv(d, "tmp.csv")

bench::mark(         
 readLines = { length(readLines("tmp.csv")) - 1 },                                  
 count.fields = { length(count.fields("tmp.csv")) - 1 },                                                    
 `readr::read_csv` = { dim(readr::read_csv("tmp.csv", col_types = readr::cols()))[[1]] },        
 `readr::count_fields` = { length(readr::count_fields("tmp.csv", tokenizer = readr::tokenizer_csv())) - 1 },
 `vroom::vroom_lines` = { length(vroom::vroom_lines("tmp.csv", altrep = TRUE, progress = FALSE)) - 1L }
 )
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 5 x 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 readLines              1.39s    1.39s     0.722   38.25MB     1.44
#> 2 count.fields        587.81ms 587.81ms     1.70    11.63MB     0   
#> 3 readr::read_csv     577.05ms 577.05ms     1.73    27.91MB     0   
#> 4 readr::count_fields 244.89ms 258.77ms     3.86     3.82MB     1.93
#> 5 vroom::vroom_lines   26.91ms  29.82ms    31.9       2.7MB     0

Created on 2020-06-05 by the reprex package (v0.3.0)

@peterhurford
Copy link
Author

@randy3k wow nice!

@clap-hands-in-the-r
Copy link

really nice! thank you for sharing. see you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment