Skip to content

Instantly share code, notes, and snippets.

@renatoargh
Last active March 14, 2018 21:36
Show Gist options
  • Save renatoargh/567fa2fbed39bea381fef8ab8589331e to your computer and use it in GitHub Desktop.
Save renatoargh/567fa2fbed39bea381fef8ab8589331e to your computer and use it in GitHub Desktop.
Huge file comparision with redis (you better use linux `diff` instead)
#! /bin/bash
set -e
set -u
# redis-cli --raw FLUSHALL >/dev/null 2>%1; # uncomment to clear the database
function hash () {
printf %s "$1" | md5sum | cut -f1 -d' '
}
# file 1
cat "$1" | while read -r line; do
hash $line | { read hash; redis-cli --raw SET $hash 1 >/dev/null 2>&1; }
done
# file 2
cat "$2" | while read -r line; do
RESULT=$(hash $line | { read hash; redis-cli --raw GET $hash; })
if [[ -z "$RESULT" ]]; then
printf "%s\n" "$line"
fi
done
# redis-cli --raw FLUSHALL >/dev/null 2>%1; # uncomment to clear the database
@renatoargh
Copy link
Author

renatoargh commented Mar 14, 2018

Split big file in multiple smaller files that will be processed in parallel by each CPU core

CHUNKS=cat /proc/cpuinfo | awk '/^cpu cores/{print $4}' | awk '{s+=$1} END {printf "%.0f", s}'
split -d --number=$CHUNKS $1 part-

@renatoargh
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment