Created
July 14, 2022 09:42
-
-
Save garfieldnate/af522ed545a11fbc170a725be2b7a735 to your computer and use it in GitHub Desktop.
Create custom user dictionary for use in MeCab through Fugashi
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Generate custom MeCab dictionary to be used with unidic-lite | |
import sys | |
from fugashi.fugashi import build_dictionary | |
import unidic_lite | |
args = ( | |
sys.argv[0] | |
+ f" -f utf8 -t utf8 -d {unidic_lite.DICDIR} -u custom.dic custom_entries.csv" | |
) | |
print(args) | |
build_dictionary(args) |
We can make this file beautiful and searchable if this error is corrected: It looks like row 3 should actually have 1 column, instead of 21 in line 2.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# REMOVE COMMENTS FIRST (MeCab doesn't allow comments) | |
# see https://twitter.com/zakki/status/920977351059554304 for a description of the fields | |
かい,830,830,6319,助詞,終助詞,*,*,*,*,カイ,かい,かい,カイ,かい,カイ,和,*,*,*,* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from fugashi import Tagger # type: ignore | |
# use the generated custom dictionary (plus unidic-lite) | |
TAGGER = Tagger("-Owakati -u custom.dic") | |
print(TAGGER("皆様こんにちは。本日はですね とっても特別なお客様にお伺いしたいと思いま〜す")) | |
# expected output: [皆, 様, こんにちは, 。, 本日, は, です, ね, , とっても, 特別, な, お, 客, 様, に, お, 伺い, し, たい, と, 思い, ま, 〜, す] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment