Last active
July 19, 2019 19:36
-
-
Save dexterous/9d99eff951fef57f07dbdbf520f4d7b4 to your computer and use it in GitHub Desktop.
Sed script to fix CSV file with unescaped new lines.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sed -nrf | |
s_,,,,_,"","","",_g # first we substitute blank fields with quoted blanks for consistency | |
s_,,,_,"","",_g # first we substitute blank fields with quoted blanks for consistency | |
s_,,_,"",_g # first we substitute blank fields with quoted blanks for consistency | |
s_,$_,""_ # then we handle similar blank trailing fields | |
/^([^"]|",|"")/ { # if the line does not start with " (incomplete line) | |
x # first swap the previous line [see (*) below] into pattern space and this incomplete line into hold space | |
G # add the above held incomplete like to the pattern separated by \n | |
s,\n,\\n,m # escape \n | |
/[^"]"$/ p # print line if it ends with " | |
h # hold the whole corrected line (incase next line is also an incomplete line, i.e. record broken over more than 2 lines) (*) | |
d # start processing next line | |
} # end multi-line processing loop | |
/[^"]"$/ p # print line if it ends with " | |
h # put pattern into hold space (*) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 | asdf | jkl; | |
---|---|---|---|
2 | hello world | there | |
3 | foo bar boo | baz | |
4 | whatever | wherever | |
5 | text with "quoted quotes" in it | too | |
6 | some more | data here |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ ./fix-multi foo.csv | |
"1","asdf","jkl;" | |
"2","hello\nworld","there" | |
"3","foo\nbar\nboo","baz" | |
"4","whatever\n","wherever" | |
"5","text with\n""quoted quotes"" in it","too" | |
"6","some more","data here" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
View foo.csv in raw mode to see line breaks, apparently GitHub's CSV renderer does a pretty damn good job of handling multiline records! 😛