Skip to content

Instantly share code, notes, and snippets.

@ozturkoktay
Last active April 15, 2020 10:42
Show Gist options
  • Save ozturkoktay/375034a6b2dc715c21cc754f1a290e9d to your computer and use it in GitHub Desktop.
Save ozturkoktay/375034a6b2dc715c21cc754f1a290e9d to your computer and use it in GitHub Desktop.
Convert text dataset to wordlist for Turkish language.
#!/bin/bash
# -*- coding: utf8 -*-
# USAGE: bash dataset_to_wordlist_tr.sh ./input.txt ./output.txt
# This script takes file clean punctuation, digits and convert capital
# latters to smaller and clean non Turkish characters.
# Finally make sorted uniq wordlist for Turkish language.
# If you want to add or remove character you can change line 20.
# Define Variables for input and output
dataset=$1
output=$2
# Check inputs
if [[ -z "$dataset" || -z "$output" ]]
then
echo "No argument supplied. Please give input and output filename."
else
cat "$dataset" | sed 's/[[:punct:]]/ /g' | sed 's/[0-9]//g' | \
tr A-Z a-z | sed 's/[^abcdefghijklmnopqrstuvwxyzşıüğçöâîû ]//g' | \
sed 's/ /\n/g' | sed '/^[[:space:]]*$/d' | \
sed 's/ //g' | sort | uniq > "$output"
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment