Last active
January 9, 2024 09:23
-
-
Save markdanese/558414d9f475c06873e682a16ec095e8 to your computer and use it in GitHub Desktop.
data.table fread fixed width file reader
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# for reading fixed with files, which are files with no delimiter (see readr package and read_fwf()) | |
# col_widths is a vector of column widths (e.g., c(8, 4, 2, 9)) | |
# input file is a character string with the input file (e.g., "./data/read.txt") | |
# on 300 MB file with 143 columns timings on 2018 Macbook pro were as follows: | |
# read_fwf from readr package: 10.8 sec | |
# non-parallel use of gawk: 10.5 sec | |
# parallel use of gawk: 4.4 sec (below function) | |
flat_fread <- function(col_widths, input_file){ | |
col_spec <- paste0(widths, collapse = " ") | |
gawk_string <- paste0("parallel --pipepart -a ", input_file, " gawk \\'\\$1=\\$1\\' FIELDWIDTHS=\\'", col_spec, "\\' OFS=,") | |
dt <- fread(cmd = gawk_string, header = FALSE, stringsAsFactors = FALSE) | |
return(dt) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment