Hiragana frequency in readings of sentences in Tono, et al., *A Frequency Dictionary of Japanese* (2013; Core5000 Anki deck)

Raw

README.md

Overview Nayr's Japanese Core5000 Anki deck (discussion) contains pronunciations of all five thousand or so sentences in A Frequency Dictionary of Japanese by Yukio Tono, Makoto Yamazaki, and Kikuo Maekawa (2013), which contains the top five thousand words in Japanese according to the latest corpus research. I analyzed these sentences to make a histogram table of hiragana occurrences, including dipthongs like きゃ, ちょ, etc. The attached two tables show the results in modern hiragana order, and sorted order.

Technical notes I parsed a file containing those sentences (with annotated readings in hiragana, in core5k-sentences.md) using the following script and helper file (in kana.txt):

cp core5k-sentences.md sacrifice.md; 
sed '/^$/d' kana.txt | while read i; do 
  echo -n $i " : " ;
  sed -n "s/$i/$i\n/gp" sacrifice.md | grep $i | wc -l;
  sed -ibak "s/$i//g" sacrifice.md; 
done | tee count-kana-core.txt

SUMMED=`cat count-kana-core.txt | awk '{sum+=$3} END {print sum}'`; 
cat count-kana-core.txt | awk -vsum=$SUMMED '{print $1 $2 " " $3 " : " $3/sum*100}' | tee percent-count-kana-core.txt

sort -k2 -t":" -n -r percent-count-kana-core.txt  > sorted-percent-count-kana-core.txt

And here's the helper file kana.txt containing the kana. Note the dipthongs are listed first, otherwise the simple-minded algorithm above will break.

きゃ
きゅ
きょ

しゃ
しゅ
しょ

ちゃ
ちゅ
ちょ

にゃ
にゅ
にょ

ひゃ
ひゅ
ひょ

みゃ
みゅ
みょ

りゃ
りゅ
りょ

ぎゃ
ぎゅ
ぎょ

じゃ
じゅ
じょ

びゃ
びゅ
びょ

ぴゃ
ぴゅ
ぴょ

あ
い
う
え
お

か
き
く
け
こ

さ
し
す
せ
そ


た
ち
つ
て
と

な
に
ぬ
ね
の

は
ひ
ふ
へ
ほ

ま
み
む
め
も

や
ゆ
よ

ら
り
る
れ
ろ

わ
を

ん

が
ぎ
ぐ
げ
ご

ざ
じ
ず
ぜ
ぞ

だ
ぢ
づ
で
ど

ば
び
ぶ
べ
ぼ

ぱ
ぴ
ぷ
ぺ
ぽ

This was done to aid in the construction of a Major or Person-Action-Object memory systems

Raw

hiragana-tono.md

Hiragana	# of occurrences	Percentage
あ	1076	1.35922
い	5943	7.5073
う	3551	4.48568
え	713	0.900673
お	1111	1.40343

か	3590	4.53495
き	1433	1.81019
く	2011	2.54033
け	878	1.1091
こ	1938	2.44811

さ	1192	1.50575
し	2828	3.57238
す	1420	1.79377
せ	826	1.04342
そ	662	0.836249

た	3489	4.40736
ち	761	0.961308
つ	1386	1.75082
て	2008	2.53654
と	1631	2.06031

な	1856	2.34453
に	2388	3.01656
ぬ	54	0.0682137
ね	350	0.442126
の	3334	4.21156

は	3331	4.20777
ひ	461	0.582343
ふ	342	0.43202
へ	186	0.234958
ほ	434	0.548236

ま	1500	1.89482
み	576	0.727613
む	227	0.28675
め	535	0.675821
も	869	1.09774

や	416	0.525498
ゆ	215	0.271592
よ	590	0.745298

ら	994	1.25564
り	823	1.03963
る	1693	2.13863
れ	1415	1.78745
ろ	321	0.405492

わ	817	1.03205
を	2012	2.54159

ん	3682	4.65116

が	1980	2.50117
ぎ	180	0.227379
ぐ	80	0.101057
げ	237	0.299382
ご	350	0.442126

ざ	120	0.151586
じ	718	0.906989
ず	172	0.217273
ぜ	117	0.147796
ぞ	106	0.133901

だ	1428	1.80387
ぢ	1	0.00126322
づ	47	0.0593712
で	1463	1.84809
ど	466	0.588659

ば	289	0.36507
び	181	0.228642
ぶ	258	0.32591
べ	217	0.274118
ぼ	110	0.138954

ぱ	57	0.0720033
ぴ	9	0.0113689
ぷ	18	0.0227379
ぺ	5	0.00631608
ぽ	22	0.0277908

きゃ	31	0.0391597
きゅ	96	0.121269
きょ	264	0.333489

しゃ	266	0.336016
しゅ	273	0.344858
しょ	416	0.525498

ちゃ	31	0.0391597
ちゅ	91	0.114953
ちょ	119	0.150323

にゃ	0	0
にゅ	46	0.058108
にょ	1	0.00126322

ひゃ	9	0.0113689
ひゅ	0	0
ひょ	25	0.0315804

みゃ	2	0.00252643
みゅ	0	0
みょ	4	0.00505287

りゃ	3	0.00378965
りゅ	17	0.0214747
りょ	201	0.253906

ぎゃ	3	0.00378965
ぎゅ	4	0.00505287
ぎょ	75	0.0947412

じゃ	25	0.0315804
じゅ	167	0.210957
じょ	464	0.586132

びゃ	1	0.00126322
びゅ	0	0
びょ	22	0.0277908

ぴゃ	0	0
ぴゅ	0	0
ぴょ	8	0.0101057

Raw

sorted-hiragana-tono.md

Hiragana	# of occurrences	Percentage
い	5943	7.5073
ん	3682	4.65116
か	3590	4.53495
う	3551	4.48568
た	3489	4.40736
の	3334	4.21156
は	3331	4.20777
し	2828	3.57238
に	2388	3.01656
を	2012	2.54159
く	2011	2.54033
て	2008	2.53654
が	1980	2.50117
こ	1938	2.44811
な	1856	2.34453
る	1693	2.13863
と	1631	2.06031
ま	1500	1.89482
で	1463	1.84809
き	1433	1.81019
だ	1428	1.80387
す	1420	1.79377
れ	1415	1.78745
つ	1386	1.75082
さ	1192	1.50575
お	1111	1.40343
あ	1076	1.35922
ら	994	1.25564
け	878	1.1091
も	869	1.09774
せ	826	1.04342
り	823	1.03963
わ	817	1.03205
ち	761	0.961308
じ	718	0.906989
え	713	0.900673
そ	662	0.836249
よ	590	0.745298
み	576	0.727613
め	535	0.675821
ど	466	0.588659
じょ	464	0.586132
ひ	461	0.582343
ほ	434	0.548236
や	416	0.525498
しょ	416	0.525498
ね	350	0.442126
ご	350	0.442126
ふ	342	0.43202
ろ	321	0.405492
ば	289	0.36507
しゅ	273	0.344858
しゃ	266	0.336016
きょ	264	0.333489
ぶ	258	0.32591
げ	237	0.299382
む	227	0.28675
べ	217	0.274118
ゆ	215	0.271592
りょ	201	0.253906
へ	186	0.234958
び	181	0.228642
ぎ	180	0.227379
ず	172	0.217273
じゅ	167	0.210957
ざ	120	0.151586
ちょ	119	0.150323
ぜ	117	0.147796
ぼ	110	0.138954
ぞ	106	0.133901
きゅ	96	0.121269
ちゅ	91	0.114953
ぐ	80	0.101057
ぎょ	75	0.0947412
ぱ	57	0.0720033
ぬ	54	0.0682137
づ	47	0.0593712
にゅ	46	0.058108
ちゃ	31	0.0391597
きゃ	31	0.0391597
ひょ	25	0.0315804
じゃ	25	0.0315804
ぽ	22	0.0277908
びょ	22	0.0277908
ぷ	18	0.0227379
りゅ	17	0.0214747
ひゃ	9	0.0113689
ぴ	9	0.0113689
ぴょ	8	0.0101057
ぺ	5	0.00631608
みょ	4	0.00505287
ぎゅ	4	0.00505287
りゃ	3	0.00378965
ぎゃ	3	0.00378965
みゃ	2	0.00252643
びゃ	1	0.00126322
にょ	1	0.00126322
ぢ	1	0.00126322
みゅ	0	0
ぴゅ	0	0
びゅ	0	0
ひゅ	0	0
ぴゃ	0	0
にゃ	0	0

LeonardAB commented Jan 15, 2016

Hi there! Thanks for this one. Such data is very rare, usually they focus on kanji, but I need only the hiragana for doing my research. Do you have data about hiragana bigram frequency? Bigram means combination of two hiraganas that appear consecutively in a text. The above data is unigram. An example of bigram is how often ま is followed by す (ます bigram). Thanks in advance!

fasiha/README.md

LeonardAB commented Jan 15, 2016

Uh oh!