# wc -l bigdata-random.csv
13013886 bigdata-random.csv
# sort --version
2.3-FreeBSD
# gsort --version
sort (GNU coreutils) 8.25
# time sort bigdata-random.csv > /dev/null
53.018u 2.188s 0:55.20 99.9% 76+172k 0+0io 0pf+0w
# time sort --radixsort bigdata-random.csv > /dev/null
45.493u 2.212s 0:47.70 100.0% 76+172k 0+0io 0pf+0w
# time sort --mergesort bigdata-random.csv > /dev/null
80.036u 1.858s 1:21.89 99.9% 76+172k 0+0io 0pf+0w
# time sort --qsort bigdata-random.csv > /dev/null
76.197u 1.732s 1:17.93 99.9% 76+172k 0+0io 0pf+0w
# time sort --heapsort bigdata-random.csv > /dev/null
124.235u 1.732s 2:05.97 99.9% 76+172k 0+0io 0pf+0w
# time gsort bigdata-random.csv > /dev/null
49.379u 6.474s 0:19.62 284.6% 187+172k 0+0io 0pf+0w
# time gsort --parallel=8 bigdata-random.csv > /dev/null
48.868u 6.843s 0:19.33 288.1% 188+172k 0+0io 0pf+0w
# time sort --mmap --radixsort bigdata-random.csv > /dev/null
50.167u 2.023s 0:52.19 99.9% 76+172k 0+0io 0pf+0w
# time sort -s --radixsort bigdata-random.csv > /dev/null
53.176u 2.322s 0:55.50 99.9% 76+172k 0+0io 0pf+0w
output is sent to null and input file is assumed to be memory cached to avoid I/O latency (time output indicates no I/O wait). Dataset consists of over 13 million lines (~890MB) of already randomized csv records.
In all combinations, BSD sort with --radixsort
option without any fancy additions seems to be the fastest.
Nice tests!
Another thing to try is
LC_ALL=C sort ...
because the C locale is trivial, see the man pages and also https://unix.stackexchange.com/a/120108/249511Note that
radixsort cannot be used for numeric or month sort
For the stable sort
-s
fancy option, the man pages saysSo it is not going to help much unless you use
-k
with a restricted set of columns. Otherwise you will be sorting on all columns anyway. It will guarantee sort order between subsequent runs no matter what but without restricting the keys-to-sort-on there will be no performance side-effect.