Published on: 2018-01-22 05:36:09 at lorainekv.com
Recently, I was playing around with the split
command-line tool on Mac OS X, and I decided to chop a 4000-line file into 4000 separate single-line files. However, when I attempted to run split -l1
, I ran into a funny error:
split: too many files
Curious to see if any splitting had occurred, I ran ls
and sure enough, a huge list of filenames appeared, such as:
xaa
xab
...
xzy
xzz
Now I could see why you'd run out of unique filenames - there are only 26 letters in the alphabet and these filenames were only three letters long. Also, they all seemed to begin with the letter "x".
I checked the manual for split's defaults and confirmed what I was seeing:
each file into which the file is split is named by the prefix followed by a lexically ordered suffix using suffix_length characters in the range 'a-z'. If -a is not specified, two letters are used as the suffix....with the prefix 'x' and with suffixes as above.
Got it, so running split
with the defaults for prefix name and suffix length will give me filenames that always start with the letter "x" followed by two-letter alphabetical permutations composed of a-z letters, with repeats allowed. I say "repeats allowed" because I noticed filenames such as xaa
and xbb
in the output.
Side node: The reason why I say "permutations" rather than "combinations" is because letter order matters. For example, xab
and xba
are two distinct and legitimate filenames. Here's a nice explanation about the difference between permutations and combinations.
So how many filenames can you get from the BSD split
tool using the defaults? There are permutation formulas out there for repeating values and non-repeating values. Based on split's behavior, I wanted to use the repeating values formula:
n^r where n equals the number of possible values (26 for a-z) and r equals the number of values (2, since there are only 2 letters after "x" in the filename).
26^2 = 676
So the total number of filename permutations allowed with BSD split's defaults should be 676.
To double check, I ran ls | wc -l
to get the total number of files in my split_test
directory. The output was 677. If you subtract my original input file, input.txt
, then you have 676, or the number of permutations split would allow before running out of filenames!
Neat. But I still wanted my 4000 files.
While 26^2 permutations doesn't support 4000 different filenames, I wondered if I could increase r to 3. Then, I'd have 17,576 different filename permutations to play with - more than enough.
Earlier, I remembered the manual mentioning suffix length:
-a suffix_length
Use suffix_length letters to form the suffix of the file name.
So I passed 3 in with the -a
flag and guess what? I got my 4000 files!
split -l1 -a3 input.txt
ls | wc -l
4001
But that was a lot of work. It would be great if split
would just handle these permutations and suffix lengths by default!
In fact, I vaguely remember splitting large files into smaller ones with numerical filenames, which I prefer. I also remember not having to worry about suffixes in the past. But numerical filenames didn't seem to be an option with split
installed on Mac OS X - there was no mention of it in the manual.
Turns out that I was remembering GNU split
from using the Debian OS two years ago, a different flavor of the split
tool with different defaults and behaviors.
You can snag the GNU version of several command-line utilities using brew install coreutils
. For further reading, there are other articles out there comparing more BSD and GNU commands differences like grep
and sed
. By default, the new commands are installed with a "g" prefix, so I ran gsplit -l1 input.txt
to try it out.
No errors! GNU split automatically adjusts the filename suffix length so that I have enough to cover 4000 filenames.
I wanted to take a closer look at the default permutations used by gsplit
and compare them to split
. So I ran them separately in two different directories and then made a list of all the filenames like this:
ls ~/split_test/ | awk '$1 ~/^x/' | sort > ~/split_test/sorted_filenames.txt
ls ~/gsplit_test/ | awk '$1 ~/^x/' | sort > ~/gsplit_test/sorted_filenames.txt
The above commands list all the files in each test directory, print only the filenames beginning with "x", then sort the output. The filenames are then written to a manifest called sorted_filenames.txt
I double-checked the line counts to ensure they were a complete manifests of both split
and gsplit
's filenames:
wc -l split_test/sorted_filenames.txt gsplit_test/sorted_filenames.txt
676 split_test/sorted_filenames.txt
4000 gsplit_test/sorted_filenames.txt
But when I used vimdiff
to compare the manifests, I noticed something a little odd.
I promised you a mystery, here it is:
While gsplit
's permutations contained three-letter filenames starting with x, vimdiff
immediately highlighted 26 missing three-letter permutations: xza-xzz. Weird!
This screenshot highlights the missing three-letter permutations from running
vimdiff split_test/sorted_filenames.txt gsplit_test/sorted_filenames.txt
My assumption was that gsplit
somehow knew that it was about to run out of name permutations 26 files in advance, so at 26^r - 26. Then it would preemptively switch to using a new two-letter prefix "xz" with longer, three-letter suffixes. The new pattern would accommodate a total of 17,576 filenames (26^3). Nothing in man gsplit
described this automatic prefix and suffix length increase, so I decided to keep playing around.
I created a directory called moar_gsplit_permutations
and a much larger input file that was 456,977 lines long, one more than 26^4 permutations. I wanted to check the filenames at 26^3 - 26 and again at 26^4 - 26 to see if the filenames got longer.
To generate my input file:
for n in {1..456977}; do echo a >> input.txt; done
(This took about 40 seconds to write)
Next I split it into 456,977 separate files:
gsplit -l1 input.txt
(Note: Before you do something like this, check your available hard drive space first. :D )
I wandered off to get a snack. The gsplit
took nearly 9 minutes to run and generated 1.7 GB of files.
Next I generated a sorted manifest of the filenames:
ls ~/moar_gsplit_permutations/ | awk '$1 ~/^x/' | sort > ~/moar_gsplit_permutations/sorted_filenames.txt
I opened up sorted_filenames.txt
and jumped to line 650, to see if the three-letter filenames got exhausted early like before. They did!
...
649 xyy
650 xyz
651 xzaaa
652 xzaab
...
If the pattern holds, then the next increase in filename prefixes and suffixes should happen at line 17550, which is 26 fewer than 26^3 permutations.
Bingo! After line 17550, the filenames shift to xzzaaaa
..., with "xzz" as the prefix and a four-letter suffix.
...
17549 xzyzy
17550 xzyzz
17551 xzzaaaa
17552 xzzaaab
...
Next, I jumped to line 456950, which is 26^n - 26. Yet again, there is a pattern shift, where both the prefix and suffix each gain a letter.
...
456949 xzzyzzy
456950 xzzyzzz
456951 xzzzaaaaa
456952 xzzzaaaab
...
Well that's neat!
In summary, gsplit
's default file naming behavior is to add a letter to the prefix and suffix of a filename whenever it reaches 26^r - 26 files (with r being the current length of the suffix), so you don't need to worry about running out of filenames (just disk space, haha).
I'm sure someone somewhere has already written about this default naming behavior before, but it was really fun to go through the process and figure out how the naming worked.
In practice, my personal preference is gsplit
's numeric suffixes because they're easier for me to navigate. Something like:
gsplit --numeric-suffixes=1 -a4 -l1 input.txt custom_prefix_
That yields some nice, sequential filenames with four-digit suffixes (for my original input of 4000 lines):
custom_prefix_0001
custom_prefix_0002
...
custom_prefix_3999
custom_prefix_4000
Getting out of my numeric naming comfort zone gave me a chance to experiment with the default lexical (alphabetic) naming, which turned out to be really interesting!