Skip to content

Instantly share code, notes, and snippets.

@pe3
Last active September 15, 2017 11:31
Show Gist options
  • Save pe3/ec4216acabfbe5034f5a29dfc9625344 to your computer and use it in GitHub Desktop.
Save pe3/ec4216acabfbe5034f5a29dfc9625344 to your computer and use it in GitHub Desktop.
a bug in readr R module

The data is from the Finnish ministry of justice election results

curl -O http://tulospalvelu.vaalit.fi/K2012/data/k-2012_ehd_maa.csv.zip
unzip k-2012_ehd_maa.csv.zip
head -n 1 kv-2012_teat_maa.csv > scandinavian.csv
iconv -f ISO-8859-1 -t UTF-8  scandinavian.csv  > scandinavian-utf.csv
$ file -I scandinavian.csv 
scandinavian.csv: text/plain; charset=iso-8859-1
$ file -I scandinavian-utf.csv
scandinavian-utf.csv: text/plain; charset=utf-8
library("readr")
library("dplyr")
#####################################################################
# works as expected when column types are chr and file is iso-8859-1
read_delim(
"scandinavian.csv",
";",
locale = locale("fi", encoding = "iso-8859-1"),
escape_double = FALSE,
col_names = FALSE,
trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)
# X11 X12
# <chr> <chr>
# 1 VAS VÄNST
#####################################################################
# works as expected when column types are fctr and file is UTF-8
read_delim(
"scandinavian-utf.csv",
";",
locale = locale("fi", encoding = "UTF-8"),
escape_double = FALSE,
col_names = FALSE,
col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))),
trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)
# X11 X12
# <chr> <chr>
# 1 VAS VÄNST
#####################################################################
# works unexpectedly when column types are fctr and file is iso-8859-1
read_delim(
"scandinavian.csv",
";",
locale = locale("fi", encoding = "iso-8859-1"),
escape_double = FALSE,
col_names = FALSE,
col_types = cols(X11 = col_factor(c("VAS")), X12 = col_factor(c("VÄNST"))),
trim_ws = TRUE
) %>% glimpse() %>% select(X11,X12)
# X11 X12
# <chr> <chr>
# 1 VAS NA
# should see:
# X11 X12
# <chr> <chr>
# 1 VAS VÄNST
K ;01;091;A;001A;HEL;HEL;06;01;01;VAS ;VÄNST ;0002;Kruununhaka A ;Kronohagen A ;Riku ;Ahola ;1;031;valtiotieteiden kandidaatti asiakasneuvoja ;091;Helsinki ;Helsingfors ; ; ; ; ; ; ;0000000;0000000;0000000;0000;0000;0000;0;0000000000;V;T;20121029180444;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment