Last active
November 16, 2015 06:36
-
-
Save darthtrevino/041b6b5da0ddac39fbbd to your computer and use it in GitHub Desktop.
File Sorting Using Shell-Based Mapreduce
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
CWD=`pwd` | |
TMP=`mktemp -d` | |
trap "rm -rf $TMP" EXIT | |
pushd $TMP | |
split -l5000 $1 | |
popd | |
for file in $TMP/* | |
do | |
cat $file | sort >$file.sorted | |
done | |
wait | |
sort -m $TMP/*.sorted |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm reading through 'The Unix Programming Environment', and wanted to write some shell script to work on my shell chops. This script splits a large file into a series of smaller chunks in a temp directory. Those chunks are then sorted in parallel and re-merged.
I tested this on a Macbook Pro circa 2012 or so with a Core i7 processor and 16gb of RAM. When operating on a 164MB file containing Moby Dick pasted several times, there wasn't much of a difference between this and the built-in shell command
sort
. After reading up on it, it appears that thesort
command is implemented using an External R-way Mergesort (http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html) , so my manual data-splitting is redundant.This implementation may make a difference if there's an upperbound to
sort
s ability to handle very large files. I may test that at a later point.