Exploring permutations and a mystery with BSD and GNU split filenames

Published on: 2018-01-22 05:36:09 at lorainekv.com

Recently, I was playing around with the split command-line tool on Mac OS X, and I decided to chop a 4000-line file into 4000 separate single-line files. However, when I attempted to run split -l1, I ran into a funny error:

split: too many files

Curious to see if any splitting had occurred, I ran ls and sure enough, a huge list of filenames appeared, such as:

xaa
xab
...
xzy
xzz

Now I could see why you'd run out of unique filenames - there are only 26 letters in the alphabet and these filenames were only three letters long. Also, they all seemed to begin with the letter "x".

BSD split's filename defaults

I checked the manual for split's defaults and confirmed what I was seeing:

each file into which the file is split is named by the prefix followed by a lexically ordered suffix using suffix_length characters in the range 'a-z'. If -a is not specified, two letters are used as the suffix....with the prefix 'x' and with suffixes as above.

Got it, so running split with the defaults for prefix name and suffix length will give me filenames that always start with the letter "x" followed by two-letter alphabetical permutations composed of a-z letters, with repeats allowed. I say "repeats allowed" because I noticed filenames such as xaa and xbb in the output.

Side node: The reason why I say "permutations" rather than "combinations" is because letter order matters. For example, xab and xba are two distinct and legitimate filenames. Here's a nice explanation about the difference between permutations and combinations.

Some permutation math

So how many filenames can you get from the BSD split tool using the defaults? There are permutation formulas out there for repeating values and non-repeating values. Based on split's behavior, I wanted to use the repeating values formula:

n^r where n equals the number of possible values (26 for a-z) and r equals the number of values (2, since there are only 2 letters after "x" in the filename).

26^2 = 676

So the total number of filename permutations allowed with BSD split's defaults should be 676.

To double check, I ran ls | wc -l to get the total number of files in my split_test directory. The output was 677. If you subtract my original input file, input.txt, then you have 676, or the number of permutations split would allow before running out of filenames!

Neat. But I still wanted my 4000 files.

Moar permutations pls

While 26^2 permutations doesn't support 4000 different filenames, I wondered if I could increase r to 3. Then, I'd have 17,576 different filename permutations to play with - more than enough.

Earlier, I remembered the manual mentioning suffix length:

-a suffix_length

Use suffix_length letters to form the suffix of the file name.

So I passed 3 in with the -a flag and guess what? I got my 4000 files!

split -l1 -a3 input.txt 
ls | wc -l
4001

But that was a lot of work. It would be great if split would just handle these permutations and suffix lengths by default!

In fact, I vaguely remember splitting large files into smaller ones with numerical filenames, which I prefer. I also remember not having to worry about suffixes in the past. But numerical filenames didn't seem to be an option with split installed on Mac OS X - there was no mention of it in the manual.

Turns out that I was remembering GNU split from using the Debian OS two years ago, a different flavor of the split tool with different defaults and behaviors.

Installing GNU split (and more)

You can snag the GNU version of several command-line utilities using brew install coreutils. For further reading, there are other articles out there comparing more BSD and GNU commands differences like grep and sed. By default, the new commands are installed with a "g" prefix, so I ran gsplit -l1 input.txt to try it out.

No errors! GNU split automatically adjusts the filename suffix length so that I have enough to cover 4000 filenames.

I wanted to take a closer look at the default permutations used by gsplit and compare them to split. So I ran them separately in two different directories and then made a list of all the filenames like this:

ls ~/split_test/ | awk '$1 ~/^x/' | sort > ~/split_test/sorted_filenames.txt
ls ~/gsplit_test/ | awk '$1 ~/^x/' | sort > ~/gsplit_test/sorted_filenames.txt

The above commands list all the files in each test directory, print only the filenames beginning with "x", then sort the output. The filenames are then written to a manifest called sorted_filenames.txt

I double-checked the line counts to ensure they were a complete manifests of both split and gsplit's filenames:

wc -l split_test/sorted_filenames.txt gsplit_test/sorted_filenames.txt 
676 split_test/sorted_filenames.txt
4000 gsplit_test/sorted_filenames.txt

But when I used vimdiff to compare the manifests, I noticed something a little odd.

A mystery! and some experiments

I promised you a mystery, here it is:

While gsplit's permutations contained three-letter filenames starting with x, vimdiff immediately highlighted 26 missing three-letter permutations: xza-xzz. Weird!

This screenshot highlights the missing three-letter permutations from running

vimdiff split_test/sorted_filenames.txt gsplit_test/sorted_filenames.txt

What's the filename pattern?

My assumption was that gsplit somehow knew that it was about to run out of name permutations 26 files in advance, so at 26^r - 26. Then it would preemptively switch to using a new two-letter prefix "xz" with longer, three-letter suffixes. The new pattern would accommodate a total of 17,576 filenames (26^3). Nothing in man gsplit described this automatic prefix and suffix length increase, so I decided to keep playing around.

I created a directory called moar_gsplit_permutations and a much larger input file that was 456,977 lines long, one more than 26^4 permutations. I wanted to check the filenames at 26^3 - 26 and again at 26^4 - 26 to see if the filenames got longer.

To generate my input file:

for n in {1..456977}; do echo a >> input.txt; done

(This took about 40 seconds to write)

Next I split it into 456,977 separate files:

gsplit -l1 input.txt

(Note: Before you do something like this, check your available hard drive space first. :D )

I wandered off to get a snack. The gsplit took nearly 9 minutes to run and generated 1.7 GB of files.

Next I generated a sorted manifest of the filenames:

ls ~/moar_gsplit_permutations/ | awk '$1 ~/^x/' | sort > ~/moar_gsplit_permutations/sorted_filenames.txt

I opened up sorted_filenames.txt and jumped to line 650, to see if the three-letter filenames got exhausted early like before. They did!

...
649 xyy
650 xyz
651 xzaaa
652 xzaab
...

If the pattern holds, then the next increase in filename prefixes and suffixes should happen at line 17550, which is 26 fewer than 26^3 permutations.

Bingo! After line 17550, the filenames shift to xzzaaaa..., with "xzz" as the prefix and a four-letter suffix.

...
17549 xzyzy
17550 xzyzz
17551 xzzaaaa
17552 xzzaaab
...

Next, I jumped to line 456950, which is 26^n - 26. Yet again, there is a pattern shift, where both the prefix and suffix each gain a letter.

...
456949 xzzyzzy
456950 xzzyzzz
456951 xzzzaaaaa
456952 xzzzaaaab
...

Conclusion

Well that's neat!

In summary, gsplit's default file naming behavior is to add a letter to the prefix and suffix of a filename whenever it reaches 26^r - 26 files (with r being the current length of the suffix), so you don't need to worry about running out of filenames (just disk space, haha).

I'm sure someone somewhere has already written about this default naming behavior before, but it was really fun to go through the process and figure out how the naming worked.

In practice, my personal preference is gsplit's numeric suffixes because they're easier for me to navigate. Something like:

gsplit --numeric-suffixes=1 -a4 -l1 input.txt custom_prefix_

That yields some nice, sequential filenames with four-digit suffixes (for my original input of 4000 lines):

custom_prefix_0001
custom_prefix_0002
...
custom_prefix_3999
custom_prefix_4000

Getting out of my numeric naming comfort zone gave me a chance to experiment with the default lexical (alphabetic) naming, which turned out to be really interesting!

blip-lorist/split.md