Skip to content

Instantly share code, notes, and snippets.

@minusInfinite
Last active August 5, 2021 11:59
Show Gist options
  • Select an option

  • Save minusInfinite/ead10cf7f9fe7ceacdc6186f56523052 to your computer and use it in GitHub Desktop.

Select an option

Save minusInfinite/ead10cf7f9fe7ceacdc6186f56523052 to your computer and use it in GitHub Desktop.
Python, CSVs and Newlines

Python, CSVs and Newlines

During my day-to-day, I work with a lot of CSV like data. The data is functional when import into Excel or using RBQL via VS Code, but now and then, you will click "Text to Column", and some data moves or Excel attempt to reformat it incorrectly.

More often than I can count, I've found this is due to lines either not having the same amount of commas or a newline between quotations. Having these discrepancies can make life difficult. While manually editing these files is possible when the data set is 60,000 or more lines, time spent on manual correction is costly.

Out of curiosity, I sort a solution with Python. Not out of any love for the language itself, it was offered as a frequent solution when searching "Fix CSV with issue" in Google. That said, it's been an interesting endeavour across many recommendations offered between Python 2,3 as well as possible just making a batch, PowerShell or bash script.

The most difficult one was finding a method of removing newlines from a CSV "cell" between quotations. Often people would create a replacement of the row and attempt to use .script() to remove the newlines. While this did remove any spaces where the line existed it would not close the quotation.

For example;

"hello
world!"

Wouldn't become "hello world!"

This was a similar experience with .replace(). Other suggestions were to create a temp script that would be filled with what was missing and inject it back into the line. I found this method to have mixed results, while it was possible to resolve the main issue of not having a single line. It was also likely it truncate the data, removing vital information.

Pouring over Google results, Stackoverflow.com and the official Python docs I eventually come across .splitlines() and ended up with the following code block which resolved my issues completely.

with open(INFILE, "r", newline="") as input:
   with open(OUTFILE, "w", newline="") as output:
      w = csv.writer(output, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
      for record in csv.reader(input):
         w.writerow(tuple(' '.join(s.splitlines()) for s in record))

If I'm to understand how it works, it takes the CSV modules, creates a reader against the in file and for each column, it will split the script into a list, then join it all back into one line.

This has resolved my issues with this data set. Given it's a common Stack Overflow question, I gather this may assist someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment