The problem:
- 1.3TB data with 5B lines in a 72GB .gz file
- Need to sort the lines and get a count for each unique line, basically a
sort | uniq -c - Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
- Solution:
sort | uniq -cwith lots of non-standard options andpigzto take care of compression
Here's the sort part, uniq I used as usual.
INPUT=$1
OUTPUT=${INPUT%.gz}.sorted.gz
export LC_ALL=C
export LC_COLLATE=C
pigz -d -c $INPUT -p 4 | sort -S 50G --parallel 20 -T /mnt/ssd/tmp --compress-program "./pigz.sh" | pigz -b 2048 -p 20 > $OUTPUT
where pigz.sh is just
pigz -b 2048 -p 20 $*
The options are
-Sgives a huge buffer forsortto work with (it only takes few seconds to sort this amount of data!)--parallelmakessortquite a bit faster if you have the cores to spare-Tplaces the temporary filessortproduces onto an SSD drive we happen to have--compress-programtellssortto compress these temporary files usingpigz.sh, which is just a wrapper script aroundpigzpigz -p 20uses up to 20 cores to compress the data