For this tutorial, we have the following brief. That is: to take a file that contains many millions of lines of text, each one of them a single column — something that looks like this:
POPLKLMNE
GKITLOPOM
QASSLKOPI
== snip ==
...into multiple text files, each with a fixed number of lines in them, each with — instead of one column per line — several columns, separated with a comma. In other words, something that looks like this:
POPLKLMNE,GKITLOPOM,QASSLKOPI
MKFOISDGT,IKLFPPMVG,TTQPFPSMX
== snip ==
Also, the files need to have windows line endings (\r\n) rather than the unix ones (\n) that our programs will probably give us by default.
Step one is to identify the discrete steps in the problem. We can't do all of this in one go — not unless we're prepared to write a whole new program to do it, which we want to avoid. The steps in this case are: 1. Split the files up according to the number of lines they have; 2. transpose rows into columns 3. Convert the line endings to windows ones.
How we do this depends on two things. If we want each file to have the same number of lines, we can use the split
command; doing e.g. split -l 1000 foo.txt
will split foo.txt
into lots of separate files, each containing 1,000 lines, and each named uniquely.
If we want to split things arbitrarily, things get a bit harder, and we need to deal with it in multiple steps.
First we'll use head
to get the first N lines of the file, and second to use tail
to get the remainder. This enables us to split the file in two. We can then repeat this step on the remainder file to continue splitting them. So let's say we need to split the file into 1,000 lines, then 2,000 lines, then 1,500 lines. We'd do:
$ head -n 1000 file.txt > split-1.txt
$ tail -n +1001 file.txt > remainder-1.txt
$ head -n 2000 remainder-1.txt > split-2.txt
$ tail -n +2001 remainder-1.txt > remainder-2.txt
$ head -n 1000 remainder-2.txt > split-3.txt
$ rm remainder-?.txt
After running these commands we should have three files — split-1.txt
, split-2.txt
, split-3.txt
— containing the splits that we wanted. Beware of fencepost errors here.
The easiest way to do this is with a script. I wrote one that does this, and you can install it with:
curl 'https://gist.githubusercontent.com/robmiller/0243ea79350c339e7e2a/raw/f9614a2d90580420f633745e55c9cc8efdac9858/rows2cols' > ~/bin/rows2cols && chmod +x ~/bin/rows2cols
Once it's installed, you can use it as follows:
$ rows2cols -c COLUMNS -s SEPARATOR FILENAME
Where COLUMNS
is the number of columns you want on each line, SEPARATOR
is the separator character, and FILENAME
is the filename to operate on.
So, to split our first file into 4 columns per line, separated by a comma, we'd call it as:
$ rows2cols -c4 -s, split-1.txt > split-1-4cols.txt
(I always try to use new filenames for each stage in the process, so that it's not destructive; if I fuck up, I can just rm split-1-4cols.txt
and try again. Never overwrite the original files.)
We can repeat this for each of our files, and it should just work.
This step is straightforward. A utility called unix2dos
ships with most Linux distributions but not OS X; however, we can install it with Homebrew:
$ brew install unix2dos
We then just pass it the filenames we want to convert:
$ unix2dos split-1.txt split-2.txt split-3.txt
And it will convert them in-place. It also comes with a dos2unix
command, for going the other way, and also utilities for converting to and from Mac line endings (\r).
We now want to verify two things: first, that the total number of lines in each file is correct; second, that the total number of lines across all the files is correct; and finally that the total number of unique lines across all files is correct.
We can do this as follows. For lines per file and total lines we can use the wc
("word count") utility, passing it -l
to tell it to count lines rather than words:
$ wc -l split-?.txt
For total unique lines, we need a few more commands:
$ cat split-?.txt | sort | uniq | wc -l
Here we concatenate the files with cat
, then pass them to sort
to be sorted alphabetically. Then uniq
removes any duplicates, and finally wc -l
tells us how many lines there are at that point. If there were duplicate lines, we'll see a number lower than the total number of lines above.
Ta-da! That's it. Lots of little reusable scripts combined to perform a more complex task. Woohoo Unix!
Introduction to pipelines: http://www.december.com/unix/tutor/pipesfilters.html GNU coreutils manual: https://www.gnu.org/software/coreutils/manual/coreutils.html#toc_Output-of-entire-files Sculpting text with regex, grep, sed, awk, emacs and vim: http://matt.might.net/articles/sculpting-text/