In progress: Attempting to write a set of command-line steps to process IBM Watson Speech-to-Text data to transcribe an old video about UNIX.
via this HN discussion: https://news.ycombinator.com/item?id=10789019
The discussion of the spell checker starts at the 5th minute with Brian Kernighan: https://www.youtube.com/watch?v=XvDZLjaCJuw&t=5m15s
It continues at the 13th minute with Lorinda Cherry: https://youtu.be/XvDZLjaCJuw?t=13m47s
$ youtube-dl https://www.youtube.com/watch?v=XvDZLjaCJuw \
--keep-video \
--extract-audio \
--audio-format wav \
--audio-quality 0 \
--restrict-filenames Output:
[youtube] Setting language
[youtube] Confirming age
[youtube] XvDZLjaCJuw: Downloading webpage
[youtube] XvDZLjaCJuw: Downloading video info webpage
[youtube] XvDZLjaCJuw: Extracting video information
[download] Destination: UNIX_-_Making_Computers_Easier_To_Use_--_AT_T_Archives_film_from_1982_Bell_Laboratories-XvDZLjaCJuw.mp4
Cut a 10 second sample at a sample rate of 16000, the max allowed for Microsoft's Oxford:
$ avconv -ss 00:15:15 \
-t 00:00:10 \
-i UNIX_-_Making_Computers_Easier_To_Use_--_AT_T_Archives_film_from_1982_Bell_Laboratories-XvDZLjaCJuw.wav \
-ar 16000 \
unixaudio-sample.wavYour application must endpoint the audio to determine start and end of speech. The endpoints specify to the service the start and end of the request. You may not upload more than 10 seconds of audio in any one request and the total request duration cannot exceed 14 seconds.
$ curl -X POST \
-d 'grant_type=client_credentials' \
-d 'client_id=dansfootest' \
-d "client_secret=$OXFORD_CLIENT_SECRET" \
-d "scope=https://speech.platform.bing.com" \
https://oxford-speech.cloudapp.net/token/issueToken{
"access_token": "XXXX.YYYY.ZZZZZ",
"expires_in": "600",
"scope": "https://speech.platform.bing.com",
"token_type": "jwt"
}$ FILENAME=unixaudio-sample.wav
$ curl --request POST \
-H "Authorization: Bearer $OXFORD_ACCESS_TOKEN" \
-H 'Content-Type: audio/wav; samplerate=16000' \
-H "Content-Length: $(wc -c < $FILENAME | tr -d ' ')" \
--data-binary "@$FILENAME" \
"https://speech.platform.bing.com/recognize?version=3.0&requestid=$(uuid)&instanceid=$(uuid)&device.os=osx&locale=en-US&format=json&appID=D4D52672-91D7-4C74-8AD8-42B1D98141A5&scenarios=ulm&result.profanitymarkup=0&maxnbest=3"{
"header": {
"lexical": "jack and i can check on my check in again i'll get up get out my spelling ass off i wanna show you another example i have a desktop",
"name": "jack and I can check on my check in again I'll get up get out my spelling ass off I wanna show you another example I have a desktop",
"properties": {
"HIGHCONF": "1",
"requestid": "abc-def-jkl-mno-xyz"
},
"scenario": "ulm",
"status": "success"
},
"results": [
{
"confidence": "0.6608185",
"lexical": "jack and i can check on my check in again i'll get up get out my spelling ass off i wanna show you another example i have a desktop",
"name": "jack and I can check on my check in again I'll get up get out my spelling ass off I wanna show you another example I have a desktop",
"properties": {
"HIGHCONF": "1"
},
"scenario": "ulm"
}
],
"version": "3.0"
}https://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/speech-to-text/api/v1/#recognize
$ curl -u "$WATSON_USERNAME":"$WATSON_PASSWORD" \
-H "content-type: audio/wav" \
--data-binary @"$FILENAME" \
"https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?continuous=true&max_alternatives=1×tamps=true&word_confidence=true"{
"result_index": 0,
"results": [
{
"alternatives": [
{
"confidence": 0.0,
"timestamps": [],
"transcript": "yeah "
}
],
"final": true
},
{
"alternatives": [
{
"confidence": 0.83,
"timestamps": [
[
"and",
0.83,
1.0
],
[
"I",
1.0,
1.08
],
[
"can",
1.08,
1.26
],
[
"run",
1.26,
1.45
],
[
"a",
1.45,
1.48
],
[
"check",
1.48,
1.78
],
[
"on",
1.78,
1.89
],
[
"my",
1.89,
2.07
],
[
"tax",
2.07,
2.6
],
[
"and",
2.74,
2.89
],
[
"again",
2.89,
3.23
],
[
"I'll",
3.23,
3.4
],
[
"get",
3.4,
3.57
],
[
"up",
3.57,
3.68
],
[
"get",
3.77,
3.95
],
[
"out",
3.95,
4.11
],
[
"my",
4.11,
4.21
],
[
"spelling",
4.21,
4.65
],
[
"errors",
4.65,
5.08
]
],
"transcript": "and I can run a check on my tax and again I'll get up get out my spelling errors ",
"word_confidence": [
[
"and",
1.0
],
[
"I",
1.0
],
[
"can",
0.7779403827740854
],
[
"run",
0.26502452550612693
],
[
"a",
0.15611111025600333
],
[
"check",
1.0
],
[
"on",
1.0
],
[
"my",
1.0
],
[
"tax",
0.4133005116503248
],
[
"and",
1.0
],
[
"again",
0.8481161519123086
],
[
"I'll",
0.7681370465998621
],
[
"get",
1.0
],
[
"up",
0.30116076025256117
],
[
"get",
1.0
],
[
"out",
1.0
],
[
"my",
1.0
],
[
"spelling",
1.0
],
[
"errors",
0.9969654445124339
]
]
}
],
"final": true
},
{
"alternatives": [
{
"confidence": 0.875,
"timestamps": [
[
"%HESITATION",
6.16,
6.52
],
[
"let",
7.1,
7.3
],
[
"me",
7.3,
7.37
],
[
"show",
7.37,
7.53
],
[
"you",
7.53,
7.64
],
[
"another",
7.64,
7.92
],
[
"example",
7.92,
8.49
]
],
"transcript": "%HESITATION let me show you another example ",
"word_confidence": [
[
"%HESITATION",
0.5481795076045378
],
[
"let",
0.771839641482555
],
[
"me",
0.9999999999999625
],
[
"show",
0.970320300917179
],
[
"you",
0.9999999999999634
],
[
"another",
0.9999999999999755
],
[
"example",
0.9894367798458484
]
]
}
],
"final": true
},
{
"alternatives": [
{
"confidence": 0.973,
"timestamps": [
[
"I",
8.98,
9.14
],
[
"have",
9.14,
9.27
],
[
"a",
9.27,
9.31
],
[
"desk",
9.31,
9.65
],
[
"jockey",
9.65,
9.92
]
],
"transcript": "I have a desk jockey ",
"word_confidence": [
[
"I",
0.9999999999999918
],
[
"have",
0.802175018932835
],
[
"a",
0.9999999999999989
],
[
"desk",
0.9999999999999988
],
[
"jockey",
0.999999999999999
]
]
}
],
"final": true
}
]
}