CSV (comma-separated values) files organise their data by separating them with newlines and commas. If a desired piece of data, say a string of text, itself contains a comma, then this string must be surrounded by double quotes:
5,7,8
-- three values
"5,7",8
-- two values
If the string also contains double quote (") characters, they must be escaped with a preceding double quote, so
"i want this text in double quotes"
could become
"i want ""this text"" in double quotes"
Note that, although escaping commas with double quotes is RFC 4180 standard, Microsoft doesn't follow this recommendation, instead just surrounding text which contains commas with double quotes. Microsoft also strips newline characters from data before exporting them to CSV. So the only escaped character we need to worry about is the double quote.
Because of this, a string of text will always be delimited by a pair of double quote characters and contain an even number of double quote characters. By treating the escape sequences (in this case, only "") as special groups in regex, we can easily capture double quoted text in CSV.
To capture strings in double quotes (including commas and escaped double quotes):
(?:,|\n|^) # all values must start at the beginning of the file,
# the end of the previous line, or at a comma
( # single capture group for ease of use; CSV can be either...
" # ...(A) a double quoted string, beginning with a double quote (")
(?: # character, containing any number (0+) of
(?:"")* # escaped double quotes (""), or
[^"]* # non-double quote characters
)* # in any order and any number of times
" # and ending with a double quote character
| # ...or (B) a non-quoted value
[^",\n]* # containing any number of characters which are not
# double quotes ("), commas (,), or newlines (\n)
| # ...or (C) a single newline or end-of-file character,
# used to capture empty values at the end of
(?:\n|$) # the file or at the ends of lines
)
(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
"(?:,|\\n|^)(\"(?:(?:\"\")*[^\"]*)*\"|[^\",\\n]*|(?:\\n|$))"
10:28 |andrew@bioinfo-wstation-01 parse| jshell
| Welcome to JShell -- Version 9.0.1
| For an introduction type: /help intro
jshell> import java.util.regex.Matcher
jshell> import java.util.regex.Pattern
jshell> String regex = "(?:,|\\n|^)(\"(?:(?:\"\")*[^\"]*)*\"|[^\",\\n]*|(?:\\n|$))"
jshell> String test1 = "\"a\",\"b\",c,\"d\",e,f,,\"g\""
jshell> String test2 = "\"\"\"test\"\" one\",test' two,\"\"\"test\"\" 'three'\",\"\"\"test 'four'\"\"\""
jshell> String test3 = "a,,\"c,c\",\" \"\"d\"\"\",\",\"\",\",f,,"
jshell> Matcher matcher = Pattern.compile(regex).matcher(test1)
jshell> int i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"
jshell> Matcher matcher = Pattern.compile(regex).matcher(test2)
jshell> i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""
jshell> Matcher matcher = Pattern.compile(regex).matcher(test3)
jshell> i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: a
1:
2: "c,c"
3: " ""d"""
4: ","","
5: f
6:
7:
jshell>
Test 1, above, shows that the parser correctly handles double quoted values, unquoted values, and empty values (,,).
Test 2 shows that it correctly handles escaped double quotes, embedded single quotes, unmatched single quotes, and combinations of these.
Test 3 shows that the parser can correctly handle commas within double quotes, whitespace at the beginning or end of quoted values, unmatched escaped quotes (ie. only one set of "" within a quoted value), and empty values at the end of lines.
The top answer for this softwareengineering.stackexchange post recommends some "edge cases", all of which this parser passes, except the "escaped commas" (",
) which are not standard in Microsoft-style CSV files.
(Inspired by this stackoverflow post.)
This is some very impressive Regex-fu. Works with Qt's regex class.