BSD sort vs GNU sort

dataset size

# wc -l bigdata-random.csv
 13013886 bigdata-random.csv

tools

# sort --version
2.3-FreeBSD

# gsort --version
sort (GNU coreutils) 8.25

BSD sort

# time sort bigdata-random.csv > /dev/null
53.018u 2.188s 0:55.20 99.9%    76+172k 0+0io 0pf+0w

# time sort --radixsort bigdata-random.csv > /dev/null
45.493u 2.212s 0:47.70 100.0%   76+172k 0+0io 0pf+0w

# time sort --mergesort bigdata-random.csv > /dev/null
80.036u 1.858s 1:21.89 99.9%    76+172k 0+0io 0pf+0w

# time sort --qsort bigdata-random.csv > /dev/null
76.197u 1.732s 1:17.93 99.9%    76+172k 0+0io 0pf+0w

# time sort --heapsort bigdata-random.csv > /dev/null
124.235u 1.732s 2:05.97 99.9%   76+172k 0+0io 0pf+0w

GNU sort

# time gsort bigdata-random.csv > /dev/null
49.379u 6.474s 0:19.62 284.6%   187+172k 0+0io 0pf+0w

# time gsort --parallel=8 bigdata-random.csv > /dev/null
48.868u 6.843s 0:19.33 288.1%   188+172k 0+0io 0pf+0w

BSD sort with fancy options

# time sort --mmap --radixsort bigdata-random.csv > /dev/null
50.167u 2.023s 0:52.19 99.9%    76+172k 0+0io 0pf+0w

# time sort -s --radixsort bigdata-random.csv > /dev/null
53.176u 2.322s 0:55.50 99.9%    76+172k 0+0io 0pf+0w

Notes and verdict

output is sent to null and input file is assumed to be memory cached to avoid I/O latency (time output indicates no I/O wait). Dataset consists of over 13 million lines (~890MB) of already randomized csv records. In all combinations, BSD sort with --radixsort option without any fancy additions seems to be the fastest.

farrokhi/sort-vs-gsort.md