Skip to content

Instantly share code, notes, and snippets.

@fasiha
Last active March 31, 2024 17:49
Show Gist options
  • Save fasiha/0f8a49868bbd0c532614 to your computer and use it in GitHub Desktop.
Save fasiha/0f8a49868bbd0c532614 to your computer and use it in GitHub Desktop.
Hiragana frequency in readings of sentences in Tono, et al., *A Frequency Dictionary of Japanese* (2013; Core5000 Anki deck)

Overview Nayr's Japanese Core5000 Anki deck (discussion) contains pronunciations of all five thousand or so sentences in A Frequency Dictionary of Japanese by Yukio Tono, Makoto Yamazaki, and Kikuo Maekawa (2013), which contains the top five thousand words in Japanese according to the latest corpus research. I analyzed these sentences to make a histogram table of hiragana occurrences, including dipthongs like きゃ, ちょ, etc. The attached two tables show the results in modern hiragana order, and sorted order.

Technical notes I parsed a file containing those sentences (with annotated readings in hiragana, in core5k-sentences.md) using the following script and helper file (in kana.txt):

cp core5k-sentences.md sacrifice.md; 
sed '/^$/d' kana.txt | while read i; do 
  echo -n $i " : " ;
  sed -n "s/$i/$i\n/gp" sacrifice.md | grep $i | wc -l;
  sed -ibak "s/$i//g" sacrifice.md; 
done | tee count-kana-core.txt

SUMMED=`cat count-kana-core.txt | awk '{sum+=$3} END {print sum}'`; 
cat count-kana-core.txt | awk -vsum=$SUMMED '{print $1 $2 " " $3 " : " $3/sum*100}' | tee percent-count-kana-core.txt

sort -k2 -t":" -n -r percent-count-kana-core.txt  > sorted-percent-count-kana-core.txt

And here's the helper file kana.txt containing the kana. Note the dipthongs are listed first, otherwise the simple-minded algorithm above will break.

きゃ
きゅ
きょ

しゃ
しゅ
しょ

ちゃ
ちゅ
ちょ

にゃ
にゅ
にょ

ひゃ
ひゅ
ひょ

みゃ
みゅ
みょ

りゃ
りゅ
りょ

ぎゃ
ぎゅ
ぎょ

じゃ
じゅ
じょ

びゃ
びゅ
びょ

ぴゃ
ぴゅ
ぴょ

あ
い
う
え
お

か
き
く
け
こ

さ
し
す
せ
そ


た
ち
つ
て
と

な
に
ぬ
ね
の

は
ひ
ふ
へ
ほ

ま
み
む
め
も

や
ゆ
よ

ら
り
る
れ
ろ

わ
を

ん

が
ぎ
ぐ
げ
ご

ざ
じ
ず
ぜ
ぞ

だ
ぢ
づ
で
ど

ば
び
ぶ
べ
ぼ

ぱ
ぴ
ぷ
ぺ
ぽ

This was done to aid in the construction of a Major or Person-Action-Object memory systems

Hiragana # of occurrences Percentage
1076 1.35922
5943 7.5073
3551 4.48568
713 0.900673
1111 1.40343
3590 4.53495
1433 1.81019
2011 2.54033
878 1.1091
1938 2.44811
1192 1.50575
2828 3.57238
1420 1.79377
826 1.04342
662 0.836249
3489 4.40736
761 0.961308
1386 1.75082
2008 2.53654
1631 2.06031
1856 2.34453
2388 3.01656
54 0.0682137
350 0.442126
3334 4.21156
3331 4.20777
461 0.582343
342 0.43202
186 0.234958
434 0.548236
1500 1.89482
576 0.727613
227 0.28675
535 0.675821
869 1.09774
416 0.525498
215 0.271592
590 0.745298
994 1.25564
823 1.03963
1693 2.13863
1415 1.78745
321 0.405492
817 1.03205
2012 2.54159
3682 4.65116
1980 2.50117
180 0.227379
80 0.101057
237 0.299382
350 0.442126
120 0.151586
718 0.906989
172 0.217273
117 0.147796
106 0.133901
1428 1.80387
1 0.00126322
47 0.0593712
1463 1.84809
466 0.588659
289 0.36507
181 0.228642
258 0.32591
217 0.274118
110 0.138954
57 0.0720033
9 0.0113689
18 0.0227379
5 0.00631608
22 0.0277908
きゃ 31 0.0391597
きゅ 96 0.121269
きょ 264 0.333489
しゃ 266 0.336016
しゅ 273 0.344858
しょ 416 0.525498
ちゃ 31 0.0391597
ちゅ 91 0.114953
ちょ 119 0.150323
にゃ 0 0
にゅ 46 0.058108
にょ 1 0.00126322
ひゃ 9 0.0113689
ひゅ 0 0
ひょ 25 0.0315804
みゃ 2 0.00252643
みゅ 0 0
みょ 4 0.00505287
りゃ 3 0.00378965
りゅ 17 0.0214747
りょ 201 0.253906
ぎゃ 3 0.00378965
ぎゅ 4 0.00505287
ぎょ 75 0.0947412
じゃ 25 0.0315804
じゅ 167 0.210957
じょ 464 0.586132
びゃ 1 0.00126322
びゅ 0 0
びょ 22 0.0277908
ぴゃ 0 0
ぴゅ 0 0
ぴょ 8 0.0101057
Hiragana # of occurrences Percentage
5943 7.5073
3682 4.65116
3590 4.53495
3551 4.48568
3489 4.40736
3334 4.21156
3331 4.20777
2828 3.57238
2388 3.01656
2012 2.54159
2011 2.54033
2008 2.53654
1980 2.50117
1938 2.44811
1856 2.34453
1693 2.13863
1631 2.06031
1500 1.89482
1463 1.84809
1433 1.81019
1428 1.80387
1420 1.79377
1415 1.78745
1386 1.75082
1192 1.50575
1111 1.40343
1076 1.35922
994 1.25564
878 1.1091
869 1.09774
826 1.04342
823 1.03963
817 1.03205
761 0.961308
718 0.906989
713 0.900673
662 0.836249
590 0.745298
576 0.727613
535 0.675821
466 0.588659
じょ 464 0.586132
461 0.582343
434 0.548236
416 0.525498
しょ 416 0.525498
350 0.442126
350 0.442126
342 0.43202
321 0.405492
289 0.36507
しゅ 273 0.344858
しゃ 266 0.336016
きょ 264 0.333489
258 0.32591
237 0.299382
227 0.28675
217 0.274118
215 0.271592
りょ 201 0.253906
186 0.234958
181 0.228642
180 0.227379
172 0.217273
じゅ 167 0.210957
120 0.151586
ちょ 119 0.150323
117 0.147796
110 0.138954
106 0.133901
きゅ 96 0.121269
ちゅ 91 0.114953
80 0.101057
ぎょ 75 0.0947412
57 0.0720033
54 0.0682137
47 0.0593712
にゅ 46 0.058108
ちゃ 31 0.0391597
きゃ 31 0.0391597
ひょ 25 0.0315804
じゃ 25 0.0315804
22 0.0277908
びょ 22 0.0277908
18 0.0227379
りゅ 17 0.0214747
ひゃ 9 0.0113689
9 0.0113689
ぴょ 8 0.0101057
5 0.00631608
みょ 4 0.00505287
ぎゅ 4 0.00505287
りゃ 3 0.00378965
ぎゃ 3 0.00378965
みゃ 2 0.00252643
びゃ 1 0.00126322
にょ 1 0.00126322
1 0.00126322
みゅ 0 0
ぴゅ 0 0
びゅ 0 0
ひゅ 0 0
ぴゃ 0 0
にゃ 0 0
@LeonardAB
Copy link

Hi there! Thanks for this one. Such data is very rare, usually they focus on kanji, but I need only the hiragana for doing my research. Do you have data about hiragana bigram frequency? Bigram means combination of two hiraganas that appear consecutively in a text. The above data is unigram. An example of bigram is how often ま is followed by す (ます bigram). Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment