Skip to content

Instantly share code, notes, and snippets.

Created July 14, 2022 09:42
Show Gist options
  • Save garfieldnate/af522ed545a11fbc170a725be2b7a735 to your computer and use it in GitHub Desktop.
Save garfieldnate/af522ed545a11fbc170a725be2b7a735 to your computer and use it in GitHub Desktop.
Create custom user dictionary for use in MeCab through Fugashi
# Generate custom MeCab dictionary to be used with unidic-lite
import sys
from fugashi.fugashi import build_dictionary
import unidic_lite
args = (
+ f" -f utf8 -t utf8 -d {unidic_lite.DICDIR} -u custom.dic custom_entries.csv"
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 1 column, instead of 21 in line 2.
# REMOVE COMMENTS FIRST (MeCab doesn't allow comments)
# see for a description of the fields
from fugashi import Tagger # type: ignore
# use the generated custom dictionary (plus unidic-lite)
TAGGER = Tagger("-Owakati -u custom.dic")
print(TAGGER("皆様こんにちは。本日はですね とっても特別なお客様にお伺いしたいと思いま〜す"))
# expected output: [皆, 様, こんにちは, 。, 本日, は, です, ね,  , とっても, 特別, な, お, 客, 様, に, お, 伺い, し, たい, と, 思い, ま, 〜, す]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment