Note: This gist refers this older gist that shows the AWS transcribe API: https://gist.github.com/dannguyen/9b8c51f5bb853209f19f1a0f18f0f74c
I went into the AWS console for Transcription, which has an interface for real-time transcription here: https://console.aws.amazon.com/transcribe/home?region=us-east-1#realTimeTranscription
Then I used my phone to play out this snippet of the 2008 VP presidential debate, featuring speech from Biden and Palin: https://twitter.com/dancow/status/1313951588428517385
The result is the JSON attached below: transcript.json
The interactive panel looks like this btw:
Even for a minute of speech, transcript.json is HUGE. The root object appears to be a big list of dicts, and each dict lookling like this:
{
"Transcript": {
"Results": [...]
}
},
It seems that when the real-time streaming API is used, the service adjusts/reinterprets the processed audio as more data streams in, and each re-interpretation generates a new Transcript
object. In other words, I should've used the non-real-time streaming object for this gist exercise.
In any case, when AWS Transcribe feels pretty good about a processed chunk, apparently it sets "isPartial": false
-- check out biden-chunk.json to see the transcribed 20-second excerpt, which contains Speaker
identification.
However, even though this portion of the audio had only Biden speaking, there is more than Speaker
attribute derived, e.g.
{
"Content": "tax",
"EndTime": 8.38,
"Speaker": "0",
"StartTime": 8.01,
"Type": "pronunciation",
"VocabularyFilterMatch": false
},
{
"Content": "cuts",
"EndTime": 8.65,
"Speaker": "2",
"StartTime": 8.39,
"Type": "pronunciation",
"VocabularyFilterMatch": false
}
Hard to tell if it's a limitation fo the service, or the poor way I delivered the audio (holding my phone up to my laptop mic)