Given that LibriVox contains enough of english content for a speech processing corpus, LibriSpeech, to be built from it, I've wondered how much content LibriVox has in languages other than English.
I've downloaded the JSON API contents of Librivox, separated the audiobooks according to their language, and summed up their lengths, obtaining a language breakdown expressed in spoken time.
This gave results of over 60 thousand hours for english, thousands of hours each for German, Dutch, French, Spanish, and hundreds of hours for other languages.
It seems that there isn't only data available of comparable amounts to LibriSpeech for four more languages, it seems that for English, LibriSpeech has tapped only a tiny fraction of Librivox's potential for usage as labeled ASR training data.
The full table of the languages in Librivox is:
Language | Number of books | Total length | Average length |
---|---|---|---|
English | 10685 | 64010:0:32 | 5:59:26 |
German | 536 | 3180:16:32 | 5:56:0 |
Dutch | 197 | 2228:11:37 | 11:18:38 |
French | 175 | 1120:17:57 | 6:24:6 |
Spanish | 198 | 1108:33:10 | 5:35:55 |
Multilingual | 111 | 499:55:30 | 4:30:13 |
Italian | 53 | 260:30:6 | 4:54:54 |
Portuguese | 53 | 243:53:15 | 4:36:5 |
Church Slavonic | 8 | 136:25:18 | 17:3:9 |
Polish | 22 | 135:9:51 | 6:8:37 |
Hebrew | 21 | 122:44:50 | 5:50:42 |
Russian | 28 | 99:8:37 | 3:32:27 |
Japanese | 37 | 97:40:18 | 2:38:23 |
Latin | 14 | 82:4:5 | 5:51:43 |
Finnish | 14 | 60:23:37 | 4:18:49 |
Swedish | 12 | 52:21:50 | 4:21:49 |
Chinese | 20 | 50:58:50 | 2:32:56 |
Danish | 10 | 43:46:45 | 4:22:40 |
Ancient Greek | 24 | 41:57:52 | 1:44:54 |
Greek | 10 | 22:18:49 | 2:13:52 |
Esperanto | 6 | 21:23:38 | 3:33:56 |
Hungarian | 2 | 18:46:49 | 9:23:24 |
Arabic | 1 | 17:6:45 | 17:6:45 |
Tamil | 2 | 11:39:12 | 5:49:36 |
Bulgarian | 4 | 11:31:5 | 2:52:46 |
Middle English | 2 | 7:54:42 | 3:57:21 |
Korean | 3 | 4:22:30 | 1:27:30 |
Ukrainian | 1 | 3:53:45 | 3:53:45 |
Latvian | 1 | 3:25:41 | 3:25:41 |
Tagalog | 4 | 2:44:44 | 0:41:11 |
Dholuo/Luo | 1 | 1:1:25 | 1:1:25 |
Javanese | 1 | 0:59:29 | 0:59:29 |
Urdu | 1 | 0:57:13 | 0:57:13 |
Telugu | 1 | 0:55:3 | 0:55:3 |
Bisaya/Cebuano | 1 | 0:14:46 | 0:14:46 |
Old English | 1 | 0:1:39 | 0:1:39 |
It would have been possible to crawl the LibriVox HTML website and obtain the data from parsing the HTML. But this is a very tedious process to set up and usually site owners don't like it for good reasons. Fortunately, I didn't need to do this, as LibriVox offers an API.
The API provides output in a variety of formats, with XML being the default. I chose to use JSON as this enables me to use jq for processing.
This is an example API call for one audiobook. If you omit the id param, you are getting a list of 50 different audiobooks. You can control the number of audiobooks you are getting and an offset in the list of audiobooks via two URL params, offset
and limit
.
To download the json info about the whole set of books, I first had to find the maximum offset for which the API still returned results. I obtained it by first exponentially increasing the offset (adding zeros to the 1), stopping once I've reached a point where the API returned, and then performing a bisection through trial and error. This gave me the maximum offset of 12900.
All I had to do now was to download the json metadata in chunks. I chose the chunk size of 100 and used 129 chunks. Also I made it sleep 5 seconds between two API calls to not disrupt the server:
mkdir -p librivox-api-json
for i in {0..129}; do wget "https://librivox.org/api/feed/audiobooks/?format=json&limit=100&offset=${i}00" -O librivox-api-json/$(printf "%03d" $i).json; sleep 5; done
This created a bunch of *.json files on the local disk. Now I only had to process them.
There is a simple method to obtain the number of books:
cat librivox-api-json/*.json | jq '.[] | .[] | .language' | sort | uniq -c | sort -nr
Now about obtaining the total length breakdown. First, we observe that the length of each book is stored in a hh:mm:ss
like format. We can filter the entire dataset for language and length only:
cat librivox-api-json/*.json | jq '.[] | .[] | { language: .language, totaltime: .totaltime } ' > 01-language-lengths.json
Now we can group the entries into languages:
cat 01-language-lengths.json | jq -s 'group_by(.language)'
Now we have to "add up" the entries to obtain sum values for the durations. Unfortunately, while jq has date support, its support for durations is bad. Therefore, we need to do this manually, by first converting them to integral format of seconds, adding them up in this format, and then converting them back into hh:mm:ss format.
The part where conversion to integrals happens is performed here:
cat 01-language-lengths.json | jq -s 'group_by(.language) | map({language: .[0].language, totaltime: map(.totaltime | select(. != "") | split(":") | map(tonumber) | (.[0] * 3600) + (.[1] * 60) + .[2]) })' > 02-lengths-integral.json
Now we can sum them up:
cat 02-lengths-integral.json | jq 'map({language: .language, totaltimesum: .totaltime | add, bookcount: .totaltime | length}) | .[] | select(.totaltimesum != null)' > 03-lengths-sum.json
And convert them back to normal time format:
cat 03-lengths-sum.json | jq -s 'def hreadable(t): [((t / 3600) | floor), (((t % 3600) / 60) | floor ), t % 60] | map(tostring) | join(":"); map(. + {hreadabletime: hreadable(.totaltimesum), averagetime: hreadable(.totaltimesum/.bookcount)}) | sort_by(-.totaltimesum)' > 04-lengths-sum-hreadable.json
We can use this to obtain a markdown table from the data:
cat 04-lengths-sum-hreadable.json | jq -r '.[] | "|" + .language + "|" + (.bookcount | tostring) + "|" + .hreadabletime + "|" + .averagetime + "|"' | (printf "|Language | Number of books | Total length|Average length|\n|-|-|-|-|\n" && cat )
The resulting markdown table is included in this document.
I've conducted this study on November 1st, 2018.
@est31 - any idea if there's an easy way to get the text from the books?