Skip to content

Instantly share code, notes, and snippets.

@billyfung
Created August 31, 2017 22:08
Show Gist options
  • Save billyfung/0a3dac63db2a70753b1f1cc04057e202 to your computer and use it in GitHub Desktop.
Save billyfung/0a3dac63db2a70753b1f1cc04057e202 to your computer and use it in GitHub Desktop.
fread with data.table 1.10.5
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-08-22 22:20:41 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread('demand_full.csv', verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
[1] Check arguments
Using 4 threads (omp_get_max_threads()=4, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
[2] Opening the file
Opening file demand_full.csv
File opened, size = 1.185GB (1272708830 bytes).
Memory mapping ... ok
[3] Detect and skip BOM
[4] Detect end-of-line character(s)
Detected eol as \n only, the UNIX and Mac standard.
[6] Skipping initial rows if needed
Positioned on line 1 starting: <<2017-02-26,1,BOB1101,24.122>>
[7] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 4 fields using quote rule 0
Detected 4 columns on line 1. This line is either column names or first data row. Line starts as: <<2017-02-26,1,BOB1101,24.122>>
Quote rule picked = 0
[8] Determine column names
Some fields on line 1 are not type character. Treating as a data row and using default column names.
[9] Detect column types
Number of sampling jump points = 101 because (1272708829 bytes from row 1 to eof) / (2 * 2763 jump0size) == 230312
Type codes (jump 000) : 6265 Quote rule 0
Type codes (jump 100) : 6265 Quote rule 0
=====
Sampled 10048 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 1272708829
Line length: mean=28.46 sd=0.74 min=27 max=34
Estimated number of rows: 1272708829 / 28.46 = 44721260
Initial alloc = 49193386 rows (44721260 + 10%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[10] Apply user overrides on column types
After 0 type and 0 drop user overrides : 6265
[11] Allocate memory for the datatable
Allocating 4 column slots (4 - 0 dropped) with 49193386 rows
[12] Read the data
Read 99%. ETA 00:00
[13] Finalizing the datatable
Read 44777716 rows x 4 columns from 1.185GB (1272708830 bytes) file in 00:10.180 wall clock time
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Final type counts
0 : drop
0 : bool8
1 : int32
0 : int32
0 : int64
1 : float64
2 : string
=============================
0.001s ( 0%) Memory map 1.185GB file
0.001s ( 0%) sep=',' ncol=4 and header detection
0.039s ( 0%) Column type detection using 10048 sample rows
1.655s ( 16%) Allocation of 44777716 rows x 4 cols (1.283GB)
8.484s ( 83%) Reading 1216 chunks of 0.998MB (36777 rows) using 4 threads
= 0.138s ( 1%) Finding first non-embedded \n after each jump
+ 6.113s ( 60%) Parse to row-major thread buffers
+ 2.156s ( 21%) Transpose
+ 0.078s ( 1%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
10.180s Total
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment