The problem:
- 1.3TB data with 5B lines in a 72GB .gz file
- Need to sort the lines and get a count for each unique line, basically a
sort | uniq -c
- Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
- Solution:
sort | uniq -c
with lots of non-standard options andpigz
to take care of compression
Here's the sort
part, uniq
I used as usual.
INPUT=$1
OUTPUT=${INPUT%.gz}.sorted.gz
export LC_ALL=C
export LC_COLLATE=C
pigz -d -c $INPUT -p 4 | sort -S 50G --parallel 20 -T /mnt/ssd/tmp --compress-program "./pigz.sh" | pigz -b 2048 -p 20 > $OUTPUT
where pigz.sh
is just
pigz -b 2048 -p 20 $*
The options are
-S
gives a huge buffer forsort
to work with (it only takes few seconds to sort this amount of data!)--parallel
makessort
quite a bit faster if you have the cores to spare-T
places the temporary filessort
produces onto an SSD drive we happen to have--compress-program
tellssort
to compress these temporary files usingpigz.sh
, which is just a wrapper script aroundpigz
pigz -p 20
uses up to 20 cores to compress the data