- Cloud Vendor Based NoOps
- Transcription
- Diarization
- Language Detection
Model | Description |
---|---|
command_and_search | Best for short queries such as voice commands or voice search. |
phone_call | Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate). |
video | Best for audio that originated from from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate. |
default | Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate. |
-
Create project; service account; download service account key file and enable API before-you-begin
-
Authenticate CLI session with
gcloud auth login
-
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the location of the service account key file
export GOOGLE_APPLICATION_CREDENTIALS=$PWD/cognitive-aab254879251.json
-
Check the currently active project
$ gcloud config get-value project bungabunga-123456
-
Set the current project
$ gcloud projects list PROJECT_ID NAME PROJECT_NUMBER cognitive-254305 cognitive 711533833686 bungabunga-123456 bungabunga 400688388535 $ gcloud config set project cognitive-254305 Updated property [core/project]. $ gcloud config get-value project cognitive-254305
- Enable API
- Check if available, enable APIs and check if enabled, required:
texttospeech.googleapis.com
andspeech.googleapis.com
$ gcloud services list --available --filter "texttospeech.googleapis.com OR speech.googleapis.com"
NAME TITLE
speech.googleapis.com Cloud Speech-to-Text API
texttospeech.googleapis.com Cloud Text-to-Speech API
$ gcloud services enable texttospeech.googleapis.com speech.googleapis.com
Operation "operations/acf.fa58e2e5-1830-41f2-aafe-991d716fe61f" finished successfully.
$ gcloud services list --enabled --filter "texttospeech.googleapis.com OR speech.googleapis.com"
NAME TITLE
speech.googleapis.com Cloud Speech-to-Text API
texttospeech.googleapis.com Cloud Text-to-Speech API
-
"An asynchronous Speech-to-Text API request to the LongRunningRecognize method is identical in form to a synchronous Speech-to-Text API request."
-
The payload size limit: 10485760 bytes.
-
Get a sample audio file
$ wget https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
--2019-10-10 18:19:50-- https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
Resolving ia803009.us.archive.org (ia803009.us.archive.org)... 207.241.233.29
Connecting to ia803009.us.archive.org (ia803009.us.archive.org)|207.241.233.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75388322 (72M) [audio/x-wav]
Saving to: 'hpr2798.wav'
hpr2798.wav 100%[==============================================================================================>] 71.90M 2.56MB/s in 1m 52s
2019-10-10 18:21:43 (659 KB/s) - 'hpr2798.wav' saved [75388322/75388322]
- Check the audio file, such as for sampling rate
- "Sample rates between 8000 Hz and 48000 Hz are supported within Cloud Speech-to-Text."
- In below, using
audio_metadata
(env-audio) $ python lsaudio.py ../data/hpr2798.wav
Duration sec: 854.7273015873016
Sample rate: 44100
- Cut the file to size from a specific starting point, and convert to WAV format with one audio channel excluding any video (if there was any)
- Audio options: -aframes number set the number of audio frames to output -aq quality set audio quality (codec-specific) -ar rate set audio sampling rate (in Hz) -ac channels set number of audio channels -an disable audio -acodec codec force audio codec ('copy' to copy stream) -vol volume change audio volume (256=normal) -af filter_graph set audio filters
$ ffmpeg -hide_banner -i ../data/hpr2798.wav -ss 00:01:14 -t 59 -ac 1 -vn audio2.wav
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'hpr2798.wav':
Metadata:
album : Hacker Public Radio
artist : knightwise
comment : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
genre : Podcast
title : Should Podcasters be Pirates ?
track : 2798
date : 2019
Duration: 00:14:14.73, bitrate: 705 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Output #0, wav, to 'audio2.wav':
Metadata:
IPRD : Hacker Public Radio
IART : knightwise
ICMT : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
IGNR : Podcast
INAM : Should Podcasters be Pirates ?
IPRT : 2798
ICRD : 2019
ISFT : Lavf58.33.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Metadata:
encoder : Lavc58.59.102 pcm_s16le
size= 5082kB time=00:00:59.00 bitrate= 705.6kbits/s speed=1.4e+03x
video:0kB audio:5082kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.006764%
- Check the converted audio file, such as for sampling rate
$ hexdump -Cn 48 audio2.wav
00000000 52 49 46 46 b0 68 4f 00 57 41 56 45 66 6d 74 20 |RIFF.hO.WAVEfmt |
00000010 10 00 00 00 01 00 01 00 44 ac 00 00 88 58 01 00 |........D....X..|
00000020 02 00 10 00 4c 49 53 54 2c 01 00 00 49 4e 46 4f |....LIST,...INFO|
(env-audio) $ python lsaudio.py ../data/audio2.wav
Duration sec: 59.0
Sample rate: 44100
- Run
gcloud ml speech recognize
; in this case US English, in other cases select the corresponding English (a dozen variants to select from)
$ gcloud ml speech recognize audio2.wav --language-code='en-US' | tee result$RANDOM.json
{
"results": [
{
"alternatives": [
{
"confidence": 0.96187854,
"transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
}
]
}
]
}
- Review the results from the JSON output file
$ jq -r '.results[].alternatives[]|.confidence,.transcript' result26358.json
0.96501887
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you're wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes
- Transfer the audio file to GCP bucket
$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project)
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][ 5.0 MiB/ 5.0 MiB] 383.4 KiB/s
Operation completed over 1 objects/5.0 MiB.
- Run
gcloud ml speech recognize-long-running
$ gcloud ml speech recognize-long-running gs://$(gcloud config get-value project)/audio2.wav --async --language-code='en-US'
Check operation [operations/5263634183516942311] for status.
{
"name": "5263634183516942311"
}
- Check when ready with
gcloud ml speech operations wait
$ gcloud ml speech operations wait "5263634183516942311" | tee result$RANDOM.json
Waiting for operation [operations/5263634183516942311] to complete...done.
{
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"confidence": 0.961879,
"transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
}
]
}
]
}
- Check the outcome with
gcloud ml speech operations describe
$ gcloud ml speech operations describe "5263634183516942311" | tee result$RANDOM.json
{
"done": true,
"metadata": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
"lastUpdateTime": "2019-10-10T15:42:16.492003Z",
"progressPercent": 100,
"startTime": "2019-10-10T15:41:57.151427Z"
},
"name": "5263634183516942311",
"response": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"confidence": 0.961879,
"transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
}
]
}
]
}
}
- Review the results from the JSON output file
$ jq -r '.response.results[].alternatives[]|.confidence,.transcript' result25359.json
0.961879
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes
- multiple-languages
- v1p1beta1/RecognitionConfig
- "Optional A list of up to 3 additional BCP-47 language tags, listing possible alternative languages of the supplied audio"
- Transfer the audio file to GCP bucket
$ gsutil cp ../data/audio[12].wav gs://$(gcloud config get-value project)
Copying file://../data/audio1.wav [Content-Type=audio/x-wav]...
Copying file://../data/audio2.wav [Content-Type=audio/x-wav]...
\ [2 files][ 10.3 MiB/ 10.3 MiB]
Operation completed over 2 objects/10.3 MiB.
- Create JSON formatted request file (request.json)
{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "alternativeLanguageCodes": [ "en-AU", "en-GB", "en-IE" ], "model": "command_and_search" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }
- Run
curl
to access the API
$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize | tee result$RANDOM.json
{
"results": [
{
"alternatives": [
{ "transcript": "checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode",
"confidence": 0.94958663
}
],
"languageCode": "en-gb"
}
]
}
- Review the results from the JSON output file
$ jq -r '.results[]|.languageCode,.alternatives[].confidence,.alternatives[].transcript' result31483.json
en-gb
0.94958663
checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode
- multiple-voices
- supported-features-languages
- "Cloud Speech-to-Text only supports speaker diarization for transcribing phone calls"
- Transfer the audio file to GCP bucket
$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project)
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][ 5.0 MiB/ 5.0 MiB] 383.4 KiB/s
Operation completed over 1 objects/5.0 MiB.
- Create JSON formatted request file (request.json)
- diarizationSpeakerCount (optional) If set, specifies the estimated number of speakers in the conversation. If not set, defaults to '2'.
{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "diarizationConfig": { "enableSpeakerDiarization": true }, "model": "phone_call" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }
- Run
curl
to access the API
$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize > result$RANDOM.json
$ ls -ltr|tail -1
-rw-r--r-- 1 bjro staff 43322 Oct 11 14:58 result28054.json
- Review the results from the JSON output file
$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==1)|.word' result28054.json |tr '\n' ' '; echo
checking in with another show for HP are in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I wanted to you know shoot something up the flag pole here wanted to talk about the state of these days I found old because in podcasting terms I am I've been around since 2004 2000 started producing show since 2005 and a big listing to podcast and 2004 and I came across my own archive from shows that I used to download back then listening to old podcast episodes of
$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==2)|.word' result28054.json |tr '\n' ' '; echo
podcasting days day 80 since and listen to which I had burn to a CD and I put them on my nose and I've started screaming them while at work the last couple of weeks and I've had up Paul