Skip to content

Instantly share code, notes, and snippets.

@lelandbatey
Last active August 29, 2015 14:25
Show Gist options
  • Save lelandbatey/d043a7da6794e624d1af to your computer and use it in GitHub Desktop.
Save lelandbatey/d043a7da6794e624d1af to your computer and use it in GitHub Desktop.
Remove errant line-feed characters in Windows csv
# To run this file, change the value below where it says ".\sample.csv" to the
# path to the CSV file you want to change. Then, to have this script write to
# a new csv file, pipe it into the Out-File cmdlet. For example, you might run
# this file (AFTER MODIFYING THE PATH TO THE INPUT FILE DOWN BELOW) with an
# invocation like this:
#
# .\remove_lf.ps1 | Out-File .\fixed_version.csv -encoding ASCII
#
# Note the option `-encoding ASCII` above. Leaving this option off might not
# cause any problems, but it may also break the output in some way. See what
# works for you!
# Reads the contents of the csv file into the variable "csv_contents"
#
# Since "Get-Content" does not actually give you a string, but an array of
# string objects, one for each line, and since doing this means interpreting
# the newline characters (which we don't want it to do), we use '-Raw' to have
# it get the raw text. You can read about this here:
# http://powershell.org/wp/2013/10/21/why-get-content-aint-yer-friend/
$csv_contents = Get-Content ".\sample.csv" -Raw
# We want to replace all the "line-feed" characters (ASCII value of 10 in
# decimal) with nothing, which is the same as deleting them. Then we want to
# replace all "carriage return" characters with "carriage-return line-feed"
# characters since that's what Windows interprets as "make a new line".
$no_linefeeds = $csv_contents -replace "`n",""
$proper_newlines = $no_linefeeds -replace "`r","`r`n"
# If you want to replace the bare newlines with spaces instead of deleting
# them, delete the lines above and use these lines:
#
# $no_linefeeds = $csv_contents -replace "`n"," "
# $proper_newlines = $no_linefeeds -replace " `r","`r`n"
$proper_newlines
# Why does this madness happen?
# In the world of ASCII characters, there are two different characters for
# "print a new line", and with subtle differences. Their names and purposes
# date all the way back to the old-school mechanical teletype machines.
# Because they where mechanical, there where some aspects of their operations
# we don't consider now when we use software-based text systems.
#
# A teletype machine, like a typewriter, had an operation that was "move the
# paper that's being fed in up one line so that the space below the current
# character is one `line` below where we where." This operation was called a
# "Line-Feed", often abbreviated as just "LF". The problem was that this
# shifted the paper up and the cursor down a line, but it did not move the
# type head to the start (the left-most edge) of the paper. So if you where in
# the center of the page and hit the button for "Line Feed", you would drop
# down a line but still be floating in the center of the page.
#
# To move to the "beginning" (left-most edge) of the page, there was an
# operator called "Carriage Return", named because it returned the carriage
# holding the paper to it's base position, which is with the typing head at
# the left-most edge of the page. It is usually abbreviated "CR"
#
# Because these where two discrete operations, a "newline" as we think of it
# was actually two values, first a carriage-return to move to the left most
# edge of the page, then a line-feed to move to the line below.
#
# Computers came along and since teletype machines where just electronically
# controllable typewriters, your computer had to send a "CR" followed by an
# "LF" to print a proper newline on the teletype. This led to the standard of
# representing a "newline" character as a "CR" character followed by an "LF"
# character.
#
# Later, some people thought that was dumb and switched to just "LF" for the
# "newline" character. Unfortunately, since Windows was programmed at the time
# when "CRLF" was a the dominant style, and since Windows is so focused on
# backwards compatibility, most Windows things nowadays still require it.
#
# For whatever reason, the CSV files you're getting as input seem to use
# `crlf` as a line-terminator, but they also use single `lf` characters as
# some sort of separator. So while this can just remove those, I've also
# included code above that will replace those bare `lf` characters with
# spaces, so the separation will still be visible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment