Skip to content

Instantly share code, notes, and snippets.

@mhmd-azeez
Last active January 6, 2022 21:04
Show Gist options
  • Select an option

  • Save mhmd-azeez/21770e507931bf3eb965ade9d5eebecf to your computer and use it in GitHub Desktop.

Select an option

Save mhmd-azeez/21770e507931bf3eb965ade9d5eebecf to your computer and use it in GitHub Desktop.
KLPT test
▁ئێمە▁ش -> ▁ئێمە▁
دەچینەوە -> چ
▁بۆ▁ -> ▁بۆ▁
ولاتی -> ولات
▁خۆم▁ان -> ▁خۆم▁
لێشیانخۆشبووین -> شیانخۆشبوو
لێشتانخۆشبووم -> شتانخۆشبوو
from klpt.stem import Stem
from klpt.tokenize import Tokenize
from klpt.preprocess import Preprocess
text = """
ئێمەش دەچینەوە بۆ ولاتی خۆمان
لێشیانخۆشبووین
لێشتانخۆشبووم
"""
preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin")
text = preprocessor.normalize(text)
text = preprocessor.standardize(text)
text = preprocessor.unify_numerals(text)
words = text.split()
stemmer = Stem("Sorani", "Arabic")
tokernizer = Tokenize("Sorani", "Arabic")
for token in tokernizer.word_tokenize(text):
print(token, '->', stemmer.stem(token)[0])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment