Skip to content

Instantly share code, notes, and snippets.

@farrokhi
Last active October 28, 2019 12:05
Show Gist options
  • Save farrokhi/79a6f58aa2f619c13efb94b470c10cf3 to your computer and use it in GitHub Desktop.
Save farrokhi/79a6f58aa2f619c13efb94b470c10cf3 to your computer and use it in GitHub Desktop.
BSD sort vs GNU sort

dataset size

# wc -l bigdata-random.csv
 13013886 bigdata-random.csv

tools

# sort --version
2.3-FreeBSD

# gsort --version
sort (GNU coreutils) 8.25

BSD sort

# time sort bigdata-random.csv > /dev/null
53.018u 2.188s 0:55.20 99.9%    76+172k 0+0io 0pf+0w

# time sort --radixsort bigdata-random.csv > /dev/null
45.493u 2.212s 0:47.70 100.0%   76+172k 0+0io 0pf+0w

# time sort --mergesort bigdata-random.csv > /dev/null
80.036u 1.858s 1:21.89 99.9%    76+172k 0+0io 0pf+0w

# time sort --qsort bigdata-random.csv > /dev/null
76.197u 1.732s 1:17.93 99.9%    76+172k 0+0io 0pf+0w

# time sort --heapsort bigdata-random.csv > /dev/null
124.235u 1.732s 2:05.97 99.9%   76+172k 0+0io 0pf+0w

GNU sort

# time gsort bigdata-random.csv > /dev/null
49.379u 6.474s 0:19.62 284.6%   187+172k 0+0io 0pf+0w

# time gsort --parallel=8 bigdata-random.csv > /dev/null
48.868u 6.843s 0:19.33 288.1%   188+172k 0+0io 0pf+0w

BSD sort with fancy options

# time sort --mmap --radixsort bigdata-random.csv > /dev/null
50.167u 2.023s 0:52.19 99.9%    76+172k 0+0io 0pf+0w

# time sort -s --radixsort bigdata-random.csv > /dev/null
53.176u 2.322s 0:55.50 99.9%    76+172k 0+0io 0pf+0w

Notes and verdict

output is sent to null and input file is assumed to be memory cached to avoid I/O latency (time output indicates no I/O wait). Dataset consists of over 13 million lines (~890MB) of already randomized csv records. In all combinations, BSD sort with --radixsort option without any fancy additions seems to be the fastest.

@davehowell
Copy link

Nice tests!

Another thing to try is LC_ALL=C sort ... because the C locale is trivial, see the man pages and also https://unix.stackexchange.com/a/120108/249511

Note that radixsort cannot be used for numeric or month sort

For the stable sort -s fancy option, the man pages says

maintains the original record order of records that have an equal key

So it is not going to help much unless you use -k with a restricted set of columns. Otherwise you will be sorting on all columns anyway. It will guarantee sort order between subsequent runs no matter what but without restricting the keys-to-sort-on there will be no performance side-effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment