Skip to content

Instantly share code, notes, and snippets.

@est31
Created November 1, 2018 17:25
Show Gist options
  • Save est31/1e195c55fab8f95a72393db1519da107 to your computer and use it in GitHub Desktop.
Save est31/1e195c55fab8f95a72393db1519da107 to your computer and use it in GitHub Desktop.

Beyond librispeech: About the amount of spoken content stored in Librivox

Overview

Given that LibriVox contains enough of english content for a speech processing corpus, LibriSpeech, to be built from it, I've wondered how much content LibriVox has in languages other than English.

I've downloaded the JSON API contents of Librivox, separated the audiobooks according to their language, and summed up their lengths, obtaining a language breakdown expressed in spoken time.

This gave results of over 60 thousand hours for english, thousands of hours each for German, Dutch, French, Spanish, and hundreds of hours for other languages.

It seems that there isn't only data available of comparable amounts to LibriSpeech for four more languages, it seems that for English, LibriSpeech has tapped only a tiny fraction of Librivox's potential for usage as labeled ASR training data.

Table

The full table of the languages in Librivox is:

Language Number of books Total length Average length
English 10685 64010:0:32 5:59:26
German 536 3180:16:32 5:56:0
Dutch 197 2228:11:37 11:18:38
French 175 1120:17:57 6:24:6
Spanish 198 1108:33:10 5:35:55
Multilingual 111 499:55:30 4:30:13
Italian 53 260:30:6 4:54:54
Portuguese 53 243:53:15 4:36:5
Church Slavonic 8 136:25:18 17:3:9
Polish 22 135:9:51 6:8:37
Hebrew 21 122:44:50 5:50:42
Russian 28 99:8:37 3:32:27
Japanese 37 97:40:18 2:38:23
Latin 14 82:4:5 5:51:43
Finnish 14 60:23:37 4:18:49
Swedish 12 52:21:50 4:21:49
Chinese 20 50:58:50 2:32:56
Danish 10 43:46:45 4:22:40
Ancient Greek 24 41:57:52 1:44:54
Greek 10 22:18:49 2:13:52
Esperanto 6 21:23:38 3:33:56
Hungarian 2 18:46:49 9:23:24
Arabic 1 17:6:45 17:6:45
Tamil 2 11:39:12 5:49:36
Bulgarian 4 11:31:5 2:52:46
Middle English 2 7:54:42 3:57:21
Korean 3 4:22:30 1:27:30
Ukrainian 1 3:53:45 3:53:45
Latvian 1 3:25:41 3:25:41
Tagalog 4 2:44:44 0:41:11
Dholuo/Luo 1 1:1:25 1:1:25
Javanese 1 0:59:29 0:59:29
Urdu 1 0:57:13 0:57:13
Telugu 1 0:55:3 0:55:3
Bisaya/Cebuano 1 0:14:46 0:14:46
Old English 1 0:1:39 0:1:39

Methodology used

It would have been possible to crawl the LibriVox HTML website and obtain the data from parsing the HTML. But this is a very tedious process to set up and usually site owners don't like it for good reasons. Fortunately, I didn't need to do this, as LibriVox offers an API.

The API provides output in a variety of formats, with XML being the default. I chose to use JSON as this enables me to use jq for processing.

This is an example API call for one audiobook. If you omit the id param, you are getting a list of 50 different audiobooks. You can control the number of audiobooks you are getting and an offset in the list of audiobooks via two URL params, offset and limit.

To download the json info about the whole set of books, I first had to find the maximum offset for which the API still returned results. I obtained it by first exponentially increasing the offset (adding zeros to the 1), stopping once I've reached a point where the API returned, and then performing a bisection through trial and error. This gave me the maximum offset of 12900.

All I had to do now was to download the json metadata in chunks. I chose the chunk size of 100 and used 129 chunks. Also I made it sleep 5 seconds between two API calls to not disrupt the server:

mkdir -p librivox-api-json
for i in {0..129}; do wget "https://librivox.org/api/feed/audiobooks/?format=json&limit=100&offset=${i}00" -O librivox-api-json/$(printf "%03d" $i).json; sleep 5; done

This created a bunch of *.json files on the local disk. Now I only had to process them.

There is a simple method to obtain the number of books:

cat librivox-api-json/*.json | jq '.[] | .[] | .language' | sort | uniq -c | sort -nr

Now about obtaining the total length breakdown. First, we observe that the length of each book is stored in a hh:mm:ss like format. We can filter the entire dataset for language and length only:

cat librivox-api-json/*.json | jq '.[] | .[] | { language: .language, totaltime: .totaltime } ' > 01-language-lengths.json

Now we can group the entries into languages:

cat 01-language-lengths.json | jq -s 'group_by(.language)'

Now we have to "add up" the entries to obtain sum values for the durations. Unfortunately, while jq has date support, its support for durations is bad. Therefore, we need to do this manually, by first converting them to integral format of seconds, adding them up in this format, and then converting them back into hh:mm:ss format.

The part where conversion to integrals happens is performed here:

cat 01-language-lengths.json | jq -s 'group_by(.language) | map({language: .[0].language, totaltime: map(.totaltime | select(. != "") | split(":") | map(tonumber) | (.[0] * 3600) + (.[1] * 60) + .[2]) })' > 02-lengths-integral.json

Now we can sum them up:

cat 02-lengths-integral.json | jq 'map({language: .language, totaltimesum: .totaltime | add, bookcount: .totaltime | length}) | .[] | select(.totaltimesum != null)' > 03-lengths-sum.json

And convert them back to normal time format:

cat 03-lengths-sum.json | jq -s 'def hreadable(t): [((t / 3600) | floor), (((t % 3600) / 60) | floor ), t % 60] | map(tostring) | join(":"); map(. + {hreadabletime: hreadable(.totaltimesum), averagetime: hreadable(.totaltimesum/.bookcount)}) | sort_by(-.totaltimesum)' > 04-lengths-sum-hreadable.json

We can use this to obtain a markdown table from the data:

cat 04-lengths-sum-hreadable.json | jq -r '.[] | "|" + .language + "|" + (.bookcount | tostring) + "|" + .hreadabletime + "|" + .averagetime + "|"' | (printf "|Language | Number of books | Total length|Average length|\n|-|-|-|-|\n" && cat )

The resulting markdown table is included in this document.

I've conducted this study on November 1st, 2018.

@JRMeyer
Copy link

JRMeyer commented Feb 4, 2019

@est31 - any idea if there's an easy way to get the text from the books?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment