Last active
April 15, 2020 10:42
-
-
Save ozturkoktay/375034a6b2dc715c21cc754f1a290e9d to your computer and use it in GitHub Desktop.
Convert text dataset to wordlist for Turkish language.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# -*- coding: utf8 -*- | |
# USAGE: bash dataset_to_wordlist_tr.sh ./input.txt ./output.txt | |
# This script takes file clean punctuation, digits and convert capital | |
# latters to smaller and clean non Turkish characters. | |
# Finally make sorted uniq wordlist for Turkish language. | |
# If you want to add or remove character you can change line 20. | |
# Define Variables for input and output | |
dataset=$1 | |
output=$2 | |
# Check inputs | |
if [[ -z "$dataset" || -z "$output" ]] | |
then | |
echo "No argument supplied. Please give input and output filename." | |
else | |
cat "$dataset" | sed 's/[[:punct:]]/ /g' | sed 's/[0-9]//g' | \ | |
tr A-Z a-z | sed 's/[^abcdefghijklmnopqrstuvwxyzşıüğçöâîû ]//g' | \ | |
sed 's/ /\n/g' | sed '/^[[:space:]]*$/d' | \ | |
sed 's/ //g' | sort | uniq > "$output" | |
fi |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment