I needed to generate an English wordlist. SCOWL (Spell Checker Oriented Word Lists) comes with a build script that allows customization of what words are included (see the README). Below is a script to generate each supported size. The output from the SCOWL build script (mk-list
) seems to be ISO-8859-1 encoded with DOS line endings, so the script converts it to UTF-8 encoding with UNIX line endings.
#!/bin/bash
# Possessive duplicates ("$word" and "$word's") are stripped by default
# Pass -p as the first arg to retain them
if [ "$1" == "-p" ]; then
POSSESSIVE=1
else
POSSESSIVE=
fi
for SIZE in 10 20 35 40 50 55 60 70 80 95
do
SCOWL_FILE=scowl-words-$SIZE.txt
perl mk-list english $SIZE |
iconv -f ISO-8859-1 -t UTF-8 |
tr -d '\r' |
( [[ ! "$POSSESSIVE" ]] && sed -E "s/'s$//g" | sort -u || cat ) > $SCOWL_FILE
SCOWL_WORDS=$(wc -l $SCOWL_FILE | sed -E 's/ *([0-9]+) .*/\1/')
echo "Created '$SCOWL_FILE' ($SCOWL_WORDS words)"
done
Shows the wordcounts for each size, which may be useful.
./mk-scowl-dict.sh
Created 'scowl-words-10.txt' (3969 words)
Created 'scowl-words-20.txt' (10746 words)
Created 'scowl-words-35.txt' (38351 words)
Created 'scowl-words-40.txt' (43394 words)
Created 'scowl-words-50.txt' (70703 words)
Created 'scowl-words-55.txt' (76203 words)
Created 'scowl-words-60.txt' (86102 words)
Created 'scowl-words-70.txt' (126304 words)
Created 'scowl-words-80.txt' (273607 words)
Created 'scowl-words-95.txt' (501583 words)
./mk-scowl-dict.sh -p
Created 'scowl-words-10.txt' (4405 words)
Created 'scowl-words-20.txt' (12359 words)
Created 'scowl-words-35.txt' (48853 words)
Created 'scowl-words-40.txt' (55962 words)
Created 'scowl-words-50.txt' (98984 words)
Created 'scowl-words-55.txt' (105210 words)
Created 'scowl-words-60.txt' (119259 words)
Created 'scowl-words-70.txt' (161369 words)
Created 'scowl-words-80.txt' (333866 words)
Created 'scowl-words-95.txt' (644673 words)
Works, thanks!