Last active
June 21, 2022 17:05
-
-
Save tarekeldeeb/016739c7fea6fe5bf19214e49b5cf9db to your computer and use it in GitHub Desktop.
Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding | |
# - Converts arabic commas to latin comma | |
# - Remove Dialect symbols | |
# - Remove running spaces with a single | |
# - Replace Alif-with-hamza with Alif | |
# | |
# Example: removeArabicDialects my_utf8.txt > clear.txt | |
# Install: Copy this gist into your ~/.bashrc | |
# Author: Tarek Eldeeb | |
# | |
removeArabicDialects () { | |
if [[ $(file -bi $1 | grep -c utf) -gt 0 ]] ; then | |
sed "s/[$(echo -ne '\u060C\u061B')]/,/g" $1 | \ | |
sed "s/[$(echo -ne '\u064B-\u065E')]//g" | \ | |
sed "s/ \+/ /g" | \ | |
sed "s/[$(echo -ne '\u0622\u0623\u0625')]/$(echo -ne '\u0627')/g"; | |
else | |
cat $1 | tr $'\xA1\xBA.,:t' ' ' | \ | |
tr -d '\356-\377\327\334\340\342\347-\353'| \ | |
sed "s/ \+/ /g"| \ | |
tr $'\xc5\xc2\xc3' $'\xc7'; | |
fi; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment