Last active
January 6, 2022 21:04
-
-
Save mhmd-azeez/21770e507931bf3eb965ade9d5eebecf to your computer and use it in GitHub Desktop.
KLPT test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ▁ئێمە▁ش -> ▁ئێمە▁ | |
| دەچینەوە -> چ | |
| ▁بۆ▁ -> ▁بۆ▁ | |
| ولاتی -> ولات | |
| ▁خۆم▁ان -> ▁خۆم▁ | |
| لێشیانخۆشبووین -> شیانخۆشبوو | |
| لێشتانخۆشبووم -> شتانخۆشبوو |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from klpt.stem import Stem | |
| from klpt.tokenize import Tokenize | |
| from klpt.preprocess import Preprocess | |
| text = """ | |
| ئێمەش دەچینەوە بۆ ولاتی خۆمان | |
| لێشیانخۆشبووین | |
| لێشتانخۆشبووم | |
| """ | |
| preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") | |
| text = preprocessor.normalize(text) | |
| text = preprocessor.standardize(text) | |
| text = preprocessor.unify_numerals(text) | |
| words = text.split() | |
| stemmer = Stem("Sorani", "Arabic") | |
| tokernizer = Tokenize("Sorani", "Arabic") | |
| for token in tokernizer.word_tokenize(text): | |
| print(token, '->', stemmer.stem(token)[0]) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment