Skip to content

Instantly share code, notes, and snippets.

@awwsmm
Last active August 2, 2023 03:36
Show Gist options
  • Save awwsmm/886ac0ce0cef517ad7092915f708175f to your computer and use it in GitHub Desktop.
Save awwsmm/886ac0ce0cef517ad7092915f708175f to your computer and use it in GitHub Desktop.
Regex for parsing Microsoft-style CSV data

Parse Microsoft-style CSV data with regex

Background

CSV (comma-separated values) files organise their data by separating them with newlines and commas. If a desired piece of data, say a string of text, itself contains a comma, then this string must be surrounded by double quotes:

5,7,8 -- three values

"5,7",8 -- two values

If the string also contains double quote (") characters, they must be escaped with a preceding double quote, so

"i want this text in double quotes"

could become

"i want ""this text"" in double quotes"

Note that, although escaping commas with double quotes is RFC 4180 standard, Microsoft doesn't follow this recommendation, instead just surrounding text which contains commas with double quotes. Microsoft also strips newline characters from data before exporting them to CSV. So the only escaped character we need to worry about is the double quote.

Because of this, a string of text will always be delimited by a pair of double quote characters and contain an even number of double quote characters. By treating the escape sequences (in this case, only "") as special groups in regex, we can easily capture double quoted text in CSV.

Explanation of regex

To capture strings in double quotes (including commas and escaped double quotes):

(?:,|\n|^)      # all values must start at the beginning of the file,  
                #   the end of the previous line, or at a comma  
(               # single capture group for ease of use; CSV can be either...  
  "             # ...(A) a double quoted string, beginning with a double quote (")  
    (?:         #        character, containing any number (0+) of  
      (?:"")*   #          escaped double quotes (""), or  
      [^"]*     #          non-double quote characters  
    )*          #        in any order and any number of times  
  "             #        and ending with a double quote character  

  |             # ...or (B) a non-quoted value  

  [^",\n]*      # containing any number of characters which are not  
                # double quotes ("), commas (,), or newlines (\n)  

  |             # ...or (C) a single newline or end-of-file character,  
                #           used to capture empty values at the end of  
  (?:\n|$)      #           the file or at the ends of lines  
)

In a more condensed form:

Regex:

(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

Escaped regex (" and \n instead of " and \n):

"(?:,|\\n|^)(\"(?:(?:\"\")*[^\"]*)*\"|[^\",\\n]*|(?:\\n|$))"

Tests with java

Test 1:

In CSV: "a","b",c,"d",e,f,,"g"

escaped: "\"a\",\"b\",c,\"d\",e,f,,\"g\""

Test 2:

In CSV: """test"" one",test' two,"""test"" 'three'","""test 'four'"""

Escaped: "\"\"\"test\"\" one\",test' two,\"\"\"test\"\" 'three'\",\"\"\"test 'four'\"\"\""

Test 3:

In CSV: a,,"c,c"," ""d""",","",",f,,

Escaped: "a,,\"c,c\",\" \"\"d\"\"\",\",\"\",\",f,,"

10:28 |andrew@bioinfo-wstation-01 parse| jshell
|  Welcome to JShell -- Version 9.0.1
|  For an introduction type: /help intro

jshell> import java.util.regex.Matcher

jshell> import java.util.regex.Pattern

jshell> String regex = "(?:,|\\n|^)(\"(?:(?:\"\")*[^\"]*)*\"|[^\",\\n]*|(?:\\n|$))"

jshell> String test1 = "\"a\",\"b\",c,\"d\",e,f,,\"g\""

jshell> String test2 = "\"\"\"test\"\" one\",test' two,\"\"\"test\"\" 'three'\",\"\"\"test 'four'\"\"\""

jshell> String test3 = "a,,\"c,c\",\" \"\"d\"\"\",\",\"\",\",f,,"

jshell> Matcher matcher = Pattern.compile(regex).matcher(test1)

jshell> int i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"

jshell> Matcher matcher = Pattern.compile(regex).matcher(test2)

jshell> i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""

jshell> Matcher matcher = Pattern.compile(regex).matcher(test3)

jshell> i = 0; while(matcher.find()) { System.out.println(i + ": " + matcher.group(1)); ++i; }
i ==> 0
0: a
1:
2: "c,c"
3: " ""d"""
4: ","","
5: f
6:
7:

jshell>

Discussion

Test 1, above, shows that the parser correctly handles double quoted values, unquoted values, and empty values (,,).

Test 2 shows that it correctly handles escaped double quotes, embedded single quotes, unmatched single quotes, and combinations of these.

Test 3 shows that the parser can correctly handle commas within double quotes, whitespace at the beginning or end of quoted values, unmatched escaped quotes (ie. only one set of "" within a quoted value), and empty values at the end of lines.

The top answer for this softwareengineering.stackexchange post recommends some "edge cases", all of which this parser passes, except the "escaped commas" (",) which are not standard in Microsoft-style CSV files.

(Inspired by this stackoverflow post.)

Comments, corrections, and suggestions welcome.

@ademarest
Copy link

This is some very impressive Regex-fu. Works with Qt's regex class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment