I compared the UniDic v3.1.0 releases based on Contemporary Written Japanese and Contemporary Spoken Japanese. They appear to contain the same number of lines in lex_3_1.csv
, but differ in the weights.
Sample values from unidic-cwj-3.1.0
:
日本,5025,3263,3997,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニポン,日本,ニポン,固,*,*,*,*,*,*,国,ニポン,ニポン,ニポン,ニポン,2,*,*,7821676679143936,28455
日本,5025,3263,-1870,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニホン,日本,ニホン,固,*,*,*,*,*,*,国,ニホン,ニホン,ニホン,ニホン,2,*,*,7821668089209344,28455
日本,5383,458,-2903,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455
Sample values from unidic-csj-3.1.0
:
日本,11235,13833,3926,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニポン,日本,ニポン,固,*,*,*,*,*,*,国,ニポン,ニポン,ニポン,ニポン,2,*,*,7821676679143936,28455
日本,3015,5548,-3280,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455
日本,11235,13833,-2256,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニホン,日本,ニホン,固,*,*,*,*,*,*,国,ニホン,ニホン,ニホン,ニホン,2,*,*,7821668089209344,28455
0000000000000000000000000000000000000000011111111111111111111111111111111111111111111222222222222222222222222222222222222222222333333333333333333333333333
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
日本,3015,5548,-3280,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455
* *
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
東京,15145,3295,-1083,名詞,固有名詞,地名,一般,*,*,トウキョウ,トウキョウ,東京,トーキョー,東京,トーキョー,固,*,*,*,*,*,*,地名,トウキョウ,トウキョウ,トウキョウ,トウキョウ,0,*,*,7129242085237248,25936
* *
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
する,13964,13715,7291,動詞,一般,*,*,五段-ラ行,終止形-一般,スル,擦る,する,スル,する,スル,和,*,*,*,*,*,*,用,スル,スル,スル,スル,1,C1,*,5370023413686955,19536
* *
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
できる,930,9378,2311,動詞,非自立可能,*,*,上一段-カ行,終止形-一般,デキル,出来る,できる,デキル,できる,デキル,和,*,*,*,*,*,*,用,デキル,デキル,デキル,デキル,2,C1,*,7043480111162027,25624
* *
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
上手く,13459,5531,5526,形容詞,一般,*,*,形容詞,連用形-一般,ウマイ,旨い,上手く,ウマク,上手い,ウマイ,和,*,*,*,*,*,*,相,ウマク,ウマイ,ウマク,ウマイ,2,C1,*,924972981822081,3365
14: pronunciation of the surface
16: pronunciation of the lemma
... so we want 14.
* *
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
やっぱ,2949,4014,3112,副詞,*,*,*,*,*,ヤハリ,矢張り,やっぱ,ヤッパ,やっぱ,ヤッパ,和,*,*,*,*,*,*,相,ヤッパ,ヤッパ,ヤッパ,ヤッパ,"1,3",*,*,10551471959581184,38386
For full description of each field (apart from weights), see https://github.com/polm/unidic-py and https://pypi.org/project/unidic/. As far as I can tell, the fields in lex_3_1.csv
line up like so:
1 surface
2 weight
3 weight
4 weight
5 pos1
6 pos2
7 pos3
8 pos4
9 cType (conjugation type)
10 cForm (conjugation shape)
11 lForm (lemma reading, e.g. トウキョウ)
12 lemma (e.g. 擦る for the surface する, or 矢張り for the surface やっぱ. Some cases like place names may be in katakana, e.g. トウキョウ)
13 orth (identical to the surface)
14 pron (pronunciation of the surface, e.g. トーキョー for 東京 or ウマク for 上手く).
15 orthBase (the uninflected form of the word in its current written form, though not necessarily the lemma. So for 彷徨った, the lemma is さ迷う but the orthBase is 彷徨う).
16 pronBase (pronunciation of the lemma, e.g. ウマイ for 上手く).
17 goshu (etymological category/origin)
18 iType
19 iForm
20 fType
21 fForm
22 iConType
23 fConType
24 type
25 kana
26 kanaBase
27 form
28 formBase
29 aType
30 aConType
31 aModType
32 weight
33 weight