Last active
August 29, 2015 14:25
-
-
Save lelandbatey/d043a7da6794e624d1af to your computer and use it in GitHub Desktop.
Remove errant line-feed characters in Windows csv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# To run this file, change the value below where it says ".\sample.csv" to the | |
# path to the CSV file you want to change. Then, to have this script write to | |
# a new csv file, pipe it into the Out-File cmdlet. For example, you might run | |
# this file (AFTER MODIFYING THE PATH TO THE INPUT FILE DOWN BELOW) with an | |
# invocation like this: | |
# | |
# .\remove_lf.ps1 | Out-File .\fixed_version.csv -encoding ASCII | |
# | |
# Note the option `-encoding ASCII` above. Leaving this option off might not | |
# cause any problems, but it may also break the output in some way. See what | |
# works for you! | |
# Reads the contents of the csv file into the variable "csv_contents" | |
# | |
# Since "Get-Content" does not actually give you a string, but an array of | |
# string objects, one for each line, and since doing this means interpreting | |
# the newline characters (which we don't want it to do), we use '-Raw' to have | |
# it get the raw text. You can read about this here: | |
# http://powershell.org/wp/2013/10/21/why-get-content-aint-yer-friend/ | |
$csv_contents = Get-Content ".\sample.csv" -Raw | |
# We want to replace all the "line-feed" characters (ASCII value of 10 in | |
# decimal) with nothing, which is the same as deleting them. Then we want to | |
# replace all "carriage return" characters with "carriage-return line-feed" | |
# characters since that's what Windows interprets as "make a new line". | |
$no_linefeeds = $csv_contents -replace "`n","" | |
$proper_newlines = $no_linefeeds -replace "`r","`r`n" | |
# If you want to replace the bare newlines with spaces instead of deleting | |
# them, delete the lines above and use these lines: | |
# | |
# $no_linefeeds = $csv_contents -replace "`n"," " | |
# $proper_newlines = $no_linefeeds -replace " `r","`r`n" | |
$proper_newlines | |
# Why does this madness happen? | |
# In the world of ASCII characters, there are two different characters for | |
# "print a new line", and with subtle differences. Their names and purposes | |
# date all the way back to the old-school mechanical teletype machines. | |
# Because they where mechanical, there where some aspects of their operations | |
# we don't consider now when we use software-based text systems. | |
# | |
# A teletype machine, like a typewriter, had an operation that was "move the | |
# paper that's being fed in up one line so that the space below the current | |
# character is one `line` below where we where." This operation was called a | |
# "Line-Feed", often abbreviated as just "LF". The problem was that this | |
# shifted the paper up and the cursor down a line, but it did not move the | |
# type head to the start (the left-most edge) of the paper. So if you where in | |
# the center of the page and hit the button for "Line Feed", you would drop | |
# down a line but still be floating in the center of the page. | |
# | |
# To move to the "beginning" (left-most edge) of the page, there was an | |
# operator called "Carriage Return", named because it returned the carriage | |
# holding the paper to it's base position, which is with the typing head at | |
# the left-most edge of the page. It is usually abbreviated "CR" | |
# | |
# Because these where two discrete operations, a "newline" as we think of it | |
# was actually two values, first a carriage-return to move to the left most | |
# edge of the page, then a line-feed to move to the line below. | |
# | |
# Computers came along and since teletype machines where just electronically | |
# controllable typewriters, your computer had to send a "CR" followed by an | |
# "LF" to print a proper newline on the teletype. This led to the standard of | |
# representing a "newline" character as a "CR" character followed by an "LF" | |
# character. | |
# | |
# Later, some people thought that was dumb and switched to just "LF" for the | |
# "newline" character. Unfortunately, since Windows was programmed at the time | |
# when "CRLF" was a the dominant style, and since Windows is so focused on | |
# backwards compatibility, most Windows things nowadays still require it. | |
# | |
# For whatever reason, the CSV files you're getting as input seem to use | |
# `crlf` as a line-terminator, but they also use single `lf` characters as | |
# some sort of separator. So while this can just remove those, I've also | |
# included code above that will replace those bare `lf` characters with | |
# spaces, so the separation will still be visible. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment