Skip to content

Instantly share code, notes, and snippets.

@kbroman
Last active July 2, 2018 21:43
Show Gist options
  • Save kbroman/10224265 to your computer and use it in GitHub Desktop.
Save kbroman/10224265 to your computer and use it in GitHub Desktop.
Weird problem with read.csv in R
We can make this file beautiful and searchable if this error is corrected: It looks like row 4 should actually have 4 columns, instead of 7 in line 3.
id,chrom,cM,bp
1_444,1,0,44
MSAT1.1,1,162,263
5-727,5,34,727,,,
5-784,5,35,784,,,
5_472,5,24,472
We can make this file beautiful and searchable if this error is corrected: It looks like row 6 should actually have 4 columns, instead of 7 in line 5.
id,chrom,cM,bp
1_444,1,0,44
1_112,1,2,112
T1G11,1,7,
MSAT1.1,1,162,263
5-727,5,34,727,,,
5-784,5,35,784,,,
5_472,5,24,472
# weird read.csv problem
# this has a few extra commas on a couple of lines;
# read.csv includes extra rows
print(x <- read.csv("gives_extra_rows.csv"))
# this is just like the file above but with two
# lines omitted (and not the problem ones)
y <- read.csv("gives_error.csv")

The read.csv function in R is just read.table with some set defaults: sep=",", header=TRUE, fill=TRUE, and some others.

Also, read.table looks at the "head" of the file (the first five lines) to determine the number of columns. In the gives_extra_rows.csv, the lines with the trailing commas are not part of that.

fill=TRUE could work okay if some of the later rows in a file have fewer columns that rows at the top, but it results in garbage if, as here, later rows have extra columns: those rows get wrapped around and then padded with NAs.

My conclusion: Don't use read.csv, and when you use read.table, don't use fill=TRUE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment