librivox-beyond-librispeech.md

Beyond librispeech: About the amount of spoken content stored in Librivox

Overview

Given that LibriVox contains enough of english content for a speech processing corpus, LibriSpeech, to be built from it, I've wondered how much content LibriVox has in languages other than English.

I've downloaded the JSON API contents of Librivox, separated the audiobooks according to their language, and summed up their lengths, obtaining a language breakdown expressed in spoken time.

This gave results of over 60 thousand hours for english, thousands of hours each for German, Dutch, French, Spanish, and hundreds of hours for other languages.

It seems that there isn't only data available of comparable amounts to LibriSpeech for four more languages, it seems that for English, LibriSpeech has tapped only a tiny fraction of Librivox's potential for usage as labeled ASR training data.

Table

The full table of the languages in Librivox is:

Language	Number of books	Total length	Average length
English	10685	64010:0:32	5:59:26
German	536	3180:16:32	5:56:0
Dutch	197	2228:11:37	11:18:38
French	175	1120:17:57	6:24:6
Spanish	198	1108:33:10	5:35:55
Multilingual	111	499:55:30	4:30:13
Italian	53	260:30:6	4:54:54
Portuguese	53	243:53:15	4:36:5
Church Slavonic	8	136:25:18	17:3:9
Polish	22	135:9:51	6:8:37
Hebrew	21	122:44:50	5:50:42
Russian	28	99:8:37	3:32:27
Japanese	37	97:40:18	2:38:23
Latin	14	82:4:5	5:51:43
Finnish	14	60:23:37	4:18:49
Swedish	12	52:21:50	4:21:49
Chinese	20	50:58:50	2:32:56
Danish	10	43:46:45	4:22:40
Ancient Greek	24	41:57:52	1:44:54
Greek	10	22:18:49	2:13:52
Esperanto	6	21:23:38	3:33:56
Hungarian	2	18:46:49	9:23:24
Arabic	1	17:6:45	17:6:45
Tamil	2	11:39:12	5:49:36
Bulgarian	4	11:31:5	2:52:46
Middle English	2	7:54:42	3:57:21
Korean	3	4:22:30	1:27:30
Ukrainian	1	3:53:45	3:53:45
Latvian	1	3:25:41	3:25:41
Tagalog	4	2:44:44	0:41:11
Dholuo/Luo	1	1:1:25	1:1:25
Javanese	1	0:59:29	0:59:29
Urdu	1	0:57:13	0:57:13
Telugu	1	0:55:3	0:55:3
Bisaya/Cebuano	1	0:14:46	0:14:46
Old English	1	0:1:39	0:1:39

Methodology used

It would have been possible to crawl the LibriVox HTML website and obtain the data from parsing the HTML. But this is a very tedious process to set up and usually site owners don't like it for good reasons. Fortunately, I didn't need to do this, as LibriVox offers an API.

The API provides output in a variety of formats, with XML being the default. I chose to use JSON as this enables me to use jq for processing.

This is an example API call for one audiobook. If you omit the id param, you are getting a list of 50 different audiobooks. You can control the number of audiobooks you are getting and an offset in the list of audiobooks via two URL params, offset and limit.

To download the json info about the whole set of books, I first had to find the maximum offset for which the API still returned results. I obtained it by first exponentially increasing the offset (adding zeros to the 1), stopping once I've reached a point where the API returned, and then performing a bisection through trial and error. This gave me the maximum offset of 12900.

All I had to do now was to download the json metadata in chunks. I chose the chunk size of 100 and used 129 chunks. Also I made it sleep 5 seconds between two API calls to not disrupt the server:

mkdir -p librivox-api-json
for i in {0..129}; do wget "https://librivox.org/api/feed/audiobooks/?format=json&limit=100&offset=${i}00" -O librivox-api-json/$(printf "%03d" $i).json; sleep 5; done

This created a bunch of *.json files on the local disk. Now I only had to process them.

There is a simple method to obtain the number of books:

cat librivox-api-json/*.json | jq '.[] | .[] | .language' | sort | uniq -c | sort -nr

Now about obtaining the total length breakdown. First, we observe that the length of each book is stored in a hh:mm:ss like format. We can filter the entire dataset for language and length only:

cat librivox-api-json/*.json | jq '.[] | .[] | { language: .language, totaltime: .totaltime } ' > 01-language-lengths.json

Now we can group the entries into languages:

cat 01-language-lengths.json | jq -s 'group_by(.language)'

Now we have to "add up" the entries to obtain sum values for the durations. Unfortunately, while jq has date support, its support for durations is bad. Therefore, we need to do this manually, by first converting them to integral format of seconds, adding them up in this format, and then converting them back into hh:mm:ss format.

The part where conversion to integrals happens is performed here:

cat 01-language-lengths.json | jq -s 'group_by(.language) | map({language: .[0].language, totaltime: map(.totaltime | select(. != "") | split(":") | map(tonumber) | (.[0] * 3600) + (.[1] * 60) + .[2]) })' > 02-lengths-integral.json

Now we can sum them up:

cat 02-lengths-integral.json | jq 'map({language: .language, totaltimesum: .totaltime | add, bookcount: .totaltime | length}) | .[] | select(.totaltimesum != null)' > 03-lengths-sum.json

And convert them back to normal time format:

cat 03-lengths-sum.json | jq -s 'def hreadable(t): [((t / 3600) | floor), (((t % 3600) / 60) | floor ), t % 60] | map(tostring) | join(":"); map(. + {hreadabletime: hreadable(.totaltimesum), averagetime: hreadable(.totaltimesum/.bookcount)}) | sort_by(-.totaltimesum)' > 04-lengths-sum-hreadable.json

We can use this to obtain a markdown table from the data:

cat 04-lengths-sum-hreadable.json | jq -r '.[] | "|" + .language + "|" + (.bookcount | tostring) + "|" + .hreadabletime + "|" + .averagetime + "|"' | (printf "|Language | Number of books | Total length|Average length|\n|-|-|-|-|\n" && cat )

The resulting markdown table is included in this document.

I've conducted this study on November 1st, 2018.

est31/librivox-beyond-librispeech.md

Beyond librispeech: About the amount of spoken content stored in Librivox

Overview

Table

Methodology used

JRMeyer commented Feb 4, 2019

Uh oh!