Text processing tutorial: splitting files, rows to columns, line endings

For this tutorial, we have the following brief. That is: to take a file that contains many millions of lines of text, each one of them a single column — something that looks like this:

POPLKLMNE
GKITLOPOM
QASSLKOPI
== snip ==

...into multiple text files, each with a fixed number of lines in them, each with — instead of one column per line — several columns, separated with a comma. In other words, something that looks like this:

POPLKLMNE,GKITLOPOM,QASSLKOPI
MKFOISDGT,IKLFPPMVG,TTQPFPSMX
== snip ==

Also, the files need to have windows line endings (\r\n) rather than the unix ones (\n) that our programs will probably give us by default.

Step one is to identify the discrete steps in the problem. We can't do all of this in one go — not unless we're prepared to write a whole new program to do it, which we want to avoid. The steps in this case are: 1. Split the files up according to the number of lines they have; 2. transpose rows into columns 3. Convert the line endings to windows ones.

Splitting the files

How we do this depends on two things. If we want each file to have the same number of lines, we can use the split command; doing e.g. split -l 1000 foo.txt will split foo.txt into lots of separate files, each containing 1,000 lines, and each named uniquely.

If we want to split things arbitrarily, things get a bit harder, and we need to deal with it in multiple steps.

First we'll use head to get the first N lines of the file, and second to use tail to get the remainder. This enables us to split the file in two. We can then repeat this step on the remainder file to continue splitting them. So let's say we need to split the file into 1,000 lines, then 2,000 lines, then 1,500 lines. We'd do:

$ head -n 1000 file.txt > split-1.txt
$ tail -n +1001 file.txt > remainder-1.txt
$ head -n 2000 remainder-1.txt > split-2.txt
$ tail -n +2001 remainder-1.txt > remainder-2.txt
$ head -n 1000 remainder-2.txt > split-3.txt
$ rm remainder-?.txt

After running these commands we should have three files — split-1.txt, split-2.txt, split-3.txt — containing the splits that we wanted. Beware of fencepost errors here.

Transpose rows into columns

The easiest way to do this is with a script. I wrote one that does this, and you can install it with:

curl 'https://gist.githubusercontent.com/robmiller/0243ea79350c339e7e2a/raw/f9614a2d90580420f633745e55c9cc8efdac9858/rows2cols' > ~/bin/rows2cols && chmod +x ~/bin/rows2cols

Once it's installed, you can use it as follows:

$ rows2cols -c COLUMNS -s SEPARATOR FILENAME

Where COLUMNS is the number of columns you want on each line, SEPARATOR is the separator character, and FILENAME is the filename to operate on.

So, to split our first file into 4 columns per line, separated by a comma, we'd call it as:

$ rows2cols -c4 -s, split-1.txt > split-1-4cols.txt

(I always try to use new filenames for each stage in the process, so that it's not destructive; if I fuck up, I can just rm split-1-4cols.txt and try again. Never overwrite the original files.)

We can repeat this for each of our files, and it should just work.

Convert Unix line endings to Windows

This step is straightforward. A utility called unix2dos ships with most Linux distributions but not OS X; however, we can install it with Homebrew:

$ brew install unix2dos

We then just pass it the filenames we want to convert:

$ unix2dos split-1.txt split-2.txt split-3.txt

And it will convert them in-place. It also comes with a dos2unix command, for going the other way, and also utilities for converting to and from Mac line endings (\r).

Testing

We now want to verify two things: first, that the total number of lines in each file is correct; second, that the total number of lines across all the files is correct; and finally that the total number of unique lines across all files is correct.

We can do this as follows. For lines per file and total lines we can use the wc ("word count") utility, passing it -l to tell it to count lines rather than words:

$ wc -l split-?.txt

For total unique lines, we need a few more commands:

$ cat split-?.txt | sort | uniq | wc -l

Here we concatenate the files with cat, then pass them to sort to be sorted alphabetically. Then uniq removes any duplicates, and finally wc -l tells us how many lines there are at that point. If there were duplicate lines, we'll see a number lower than the total number of lines above.

Ta-da! That's it. Lots of little reusable scripts combined to perform a more complex task. Woohoo Unix!

robmiller/gist:214b3adf820b41a847c2

Splitting the files

Transpose rows into columns

Convert Unix line endings to Windows

Testing

Further reading