Skip to content

Instantly share code, notes, and snippets.

@darthtrevino
Last active November 16, 2015 06:36
Show Gist options
  • Save darthtrevino/041b6b5da0ddac39fbbd to your computer and use it in GitHub Desktop.
Save darthtrevino/041b6b5da0ddac39fbbd to your computer and use it in GitHub Desktop.
File Sorting Using Shell-Based Mapreduce
#!/bin/sh
CWD=`pwd`
TMP=`mktemp -d`
trap "rm -rf $TMP" EXIT
pushd $TMP
split -l5000 $1
popd
for file in $TMP/*
do
cat $file | sort >$file.sorted
done
wait
sort -m $TMP/*.sorted
@darthtrevino
Copy link
Author

I'm reading through 'The Unix Programming Environment', and wanted to write some shell script to work on my shell chops. This script splits a large file into a series of smaller chunks in a temp directory. Those chunks are then sorted in parallel and re-merged.

I tested this on a Macbook Pro circa 2012 or so with a Core i7 processor and 16gb of RAM. When operating on a 164MB file containing Moby Dick pasted several times, there wasn't much of a difference between this and the built-in shell command sort. After reading up on it, it appears that the sort command is implemented using an External R-way Mergesort (http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html) , so my manual data-splitting is redundant.

This implementation may make a difference if there's an upperbound to sorts ability to handle very large files. I may test that at a later point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment