Skip to content

Instantly share code, notes, and snippets.

@shirakaba
Last active August 9, 2023 01:42
Show Gist options
  • Save shirakaba/db0cd736206079e8d00e16b5e6bd086e to your computer and use it in GitHub Desktop.
Save shirakaba/db0cd736206079e8d00e16b5e6bd086e to your computer and use it in GitHub Desktop.
UniDic fields

I compared the UniDic v3.1.0 releases based on Contemporary Written Japanese and Contemporary Spoken Japanese. They appear to contain the same number of lines in lex_3_1.csv, but differ in the weights.

Sample values from unidic-cwj-3.1.0:

日本,5025,3263,3997,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニポン,日本,ニポン,固,*,*,*,*,*,*,国,ニポン,ニポン,ニポン,ニポン,2,*,*,7821676679143936,28455
日本,5025,3263,-1870,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニホン,日本,ニホン,固,*,*,*,*,*,*,国,ニホン,ニホン,ニホン,ニホン,2,*,*,7821668089209344,28455
日本,5383,458,-2903,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455

Sample values from unidic-csj-3.1.0:

日本,11235,13833,3926,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニポン,日本,ニポン,固,*,*,*,*,*,*,国,ニポン,ニポン,ニポン,ニポン,2,*,*,7821676679143936,28455
日本,3015,5548,-3280,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455
日本,11235,13833,-2256,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニホン,日本,ニホン,固,*,*,*,*,*,*,国,ニホン,ニホン,ニホン,ニホン,2,*,*,7821668089209344,28455

0000000000000000000000000000000000000000011111111111111111111111111111111111111111111222222222222222222222222222222222222222222333333333333333333333333333
 1    2    3     4    5    6     7   8 9 0   1      2   3    4      5    6    7  8 9 0 1 2 3 4    5      6      7       8    9 0 1        2           3
日本,3015,5548,-3280,名詞,固有名詞,地名,国,*,*,ニッポン,日本,日本,ニッポン,日本,ニッポン,固,*,*,*,*,*,*,国,ニッポン,ニッポン,ニッポン,ニッポン,3,*,*,7821659499274752,28455

                                                                       *             *
 1    2    3     4    5     6     7   8  9 0     1        2      3     4      5      6     7 8 9 0 1 2 3  4      5        6         7       8     9 0 1        2           3
東京,15145,3295,-1083,名詞,固有名詞,地名,一般,*,*,トウキョウ,トウキョウ,東京,トーキョー,東京,トーキョー,固,*,*,*,*,*,*,地名,トウキョウ,トウキョウ,トウキョウ,トウキョウ,0,*,*,7129242085237248,25936

                                                                  *        *
 1    2     3     4   5    6  7 8    9         0     1    2   3   4   5    6  7 8 9 0 1 2 3  4  5    6   7   8  9 0  1        2           3
する,13964,13715,7291,動詞,一般,*,*,五段-ラ行,終止形-一般,スル,擦る,する,スル,する,スル,和,*,*,*,*,*,*,用,スル,スル,スル,スル,1,C1,*,5370023413686955,19536

                                                                             *          *
  1    2   3    4    5     6     7 8      9         0       1    2     3     4     5    6    7 8 9 0 1 2 3  4   5    6     7      8   9 0 1        2           3
できる,930,9378,2311,動詞,非自立可能,*,*,上一段-カ行,終止形-一般,デキル,出来る,できる,デキル,できる,デキル,和,*,*,*,*,*,*,用,デキル,デキル,デキル,デキル,2,C1,*,7043480111162027,25624

                                                                      *          *
  1      2   3    4    5     6  7 8   9        0      1    2    3     4     5    6    7 8 9 0 1 2 3  4   5    6     7      8  9 0  1        2          3
上手く,13459,5531,5526,形容詞,一般,*,*,形容詞,連用形-一般,ウマイ,旨い,上手く,ウマク,上手い,ウマイ,和,*,*,*,*,*,*,相,ウマク,ウマイ,ウマク,ウマイ,2,C1,*,924972981822081,3365
14: pronunciation of the surface
16: pronunciation of the lemma
... so we want 14.

                                                       *           *
  1     2   3    4    5  6 7 8 9 0   1     2     3     4    5      6   7 8 9 0 1 2 3 4    5     6    7     8     9   0 1        2             3
やっぱ,2949,4014,3112,副詞,*,*,*,*,*,ヤハリ,矢張り,やっぱ,ヤッパ,やっぱ,ヤッパ,和,*,*,*,*,*,*,相,ヤッパ,ヤッパ,ヤッパ,ヤッパ,"1,3",*,*,10551471959581184,38386

For full description of each field (apart from weights), see https://github.com/polm/unidic-py and https://pypi.org/project/unidic/. As far as I can tell, the fields in lex_3_1.csv line up like so:

1 surface
2 weight
3 weight
4 weight
5 pos1
6 pos2
7 pos3
8 pos4
9 cType (conjugation type)
10 cForm (conjugation shape)
11 lForm (lemma reading, e.g. トウキョウ)
12 lemma (e.g. 擦る for the surface する, or 矢張り for the surface やっぱ. Some cases like place names may be in katakana, e.g. トウキョウ)
13 orth (identical to the surface)
14 pron (pronunciation of the surface, e.g. トーキョー for 東京 or ウマク for 上手く).
15 orthBase (the uninflected form of the word in its current written form, though not necessarily the lemma. So for 彷徨った, the lemma is さ迷う but the orthBase is 彷徨う).
16 pronBase (pronunciation of the lemma, e.g. ウマイ for 上手く).
17 goshu (etymological category/origin)
18 iType
19 iForm
20 fType
21 fForm
22 iConType
23 fConType
24 type
25 kana
26 kanaBase
27 form
28 formBase
29 aType
30 aConType
31 aModType
32 weight
33 weight
Copyright (c) 2011-2021, The UniDic Consortium
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the
distribution.
* Neither the name of the UniDic Consortium nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment