Last active
January 6, 2022 01:41
-
-
Save geekley/3eb0bb8156ce0bede487b87a220b4be7 to your computer and use it in GitHub Desktop.
Asciify a spell-check dictionary (word list). It filters words from a .dic with non-ascii chars and transforms the words into ascii-only versions. https://github.com/streetsidesoftware/cspell/issues/1060#issuecomment-1006199819
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env bash | |
#License: "Zero-Clause BSD" <https://opensource.org/licenses/0BSD> | |
# Requires perl and unidecode module (in Ubuntu, it can be installed with sudo apt install libtext-unidecode-perl). | |
# Example usage: asciify-dic $DIC_NAME.dic > $DIC_NAME-asciified.dic | |
if [[ "$1" == "--help" ]]; then | |
echo "Usage: $(basename "$0") INPUT_FILE > OUTPUT_FILE" | |
echo "Asciify a .dic file (list of dictionary words)." | |
echo "" | |
echo "Generates a file with ASCII-only versions of the words that have non-ASCII chars." | |
echo "These additional words can be used to make spell-checking accent-insensitive." | |
echo "Comment lines beginning with % are left unchanged." | |
exit | |
fi | |
# Filter words containing non-ascii characters, except in comments | |
grep -P '^\%|[^\x00-\x7F]' $1 | | |
# Make words accent-insensitive, except in comments | |
perl -C -MText::Unidecode -pe'next if /^\s*%/;unidecode($_)' | | |
# Remove duplicate lines, except in comments | |
awk '/^\s*%/||!seen[$0]++' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment