Cognitive Artificial Intelligence

Cloud Vendor Based NoOps

Use Cases

Transcription
Diarization
Language Detection

GCP (Google Cloud Platform)

Model	Description
command_and_search	Best for short queries such as voice commands or voice search.
phone_call	Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate).
video	Best for audio that originated from from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.
default	Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.

Create project; service account; download service account key file and enable API before-you-begin
Authenticate CLI session with gcloud auth login
Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the location of the service account key file
```
export GOOGLE_APPLICATION_CREDENTIALS=$PWD/cognitive-aab254879251.json
```

Check the currently active project

$ gcloud config get-value project 
bungabunga-123456

Set the current project

$ gcloud projects list
PROJECT_ID          NAME                PROJECT_NUMBER
cognitive-254305    cognitive           711533833686
bungabunga-123456   bungabunga          400688388535

$ gcloud config set project cognitive-254305
Updated property [core/project].

$ gcloud config get-value project 
cognitive-254305

APIs

Enable API
Check if available, enable APIs and check if enabled, required: texttospeech.googleapis.com and speech.googleapis.com

$ gcloud services list --available --filter "texttospeech.googleapis.com OR speech.googleapis.com"
NAME                         TITLE
speech.googleapis.com        Cloud Speech-to-Text API
texttospeech.googleapis.com  Cloud Text-to-Speech API

$ gcloud services enable texttospeech.googleapis.com speech.googleapis.com
Operation "operations/acf.fa58e2e5-1830-41f2-aafe-991d716fe61f" finished successfully.

$ gcloud services list --enabled --filter "texttospeech.googleapis.com OR speech.googleapis.com"  
NAME                         TITLE
speech.googleapis.com        Cloud Speech-to-Text API
texttospeech.googleapis.com  Cloud Text-to-Speech API

Transcription (short/sync)

Transcribing short audio files (less than a minute)
"An asynchronous Speech-to-Text API request to the LongRunningRecognize method is identical in form to a synchronous Speech-to-Text API request."
The payload size limit: 10485760 bytes.
Get a sample audio file

$ wget https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
--2019-10-10 18:19:50--  https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
Resolving ia803009.us.archive.org (ia803009.us.archive.org)... 207.241.233.29
Connecting to ia803009.us.archive.org (ia803009.us.archive.org)|207.241.233.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75388322 (72M) [audio/x-wav]
Saving to: 'hpr2798.wav'

hpr2798.wav                                   100%[==============================================================================================>]  71.90M  2.56MB/s    in 1m 52s  

2019-10-10 18:21:43 (659 KB/s) - 'hpr2798.wav' saved [75388322/75388322]

Check the audio file, such as for sampling rate
- "Sample rates between 8000 Hz and 48000 Hz are supported within Cloud Speech-to-Text."
- In below, using audio_metadata

(env-audio) $ python lsaudio.py ../data/hpr2798.wav
Duration sec:	 854.7273015873016
Sample rate:	 44100

Cut the file to size from a specific starting point, and convert to WAV format with one audio channel excluding any video (if there was any)
- Audio options: -aframes number set the number of audio frames to output -aq quality set audio quality (codec-specific) -ar rate set audio sampling rate (in Hz) -ac channels set number of audio channels -an disable audio -acodec codec force audio codec ('copy' to copy stream) -vol volume change audio volume (256=normal) -af filter_graph set audio filters

$ ffmpeg -hide_banner -i ../data/hpr2798.wav -ss 00:01:14 -t 59 -ac 1 -vn audio2.wav
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'hpr2798.wav':
  Metadata:
    album           : Hacker Public Radio
    artist          : knightwise
    comment         : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
    genre           : Podcast
    title           : Should Podcasters be Pirates ?
    track           : 2798
    date            : 2019
  Duration: 00:14:14.73, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Output #0, wav, to 'audio2.wav':
  Metadata:
    IPRD            : Hacker Public Radio
    IART            : knightwise
    ICMT            : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
    IGNR            : Podcast
    INAM            : Should Podcasters be Pirates ?
    IPRT            : 2798
    ICRD            : 2019
    ISFT            : Lavf58.33.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
    Metadata:
      encoder         : Lavc58.59.102 pcm_s16le
size=    5082kB time=00:00:59.00 bitrate= 705.6kbits/s speed=1.4e+03x    
video:0kB audio:5082kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.006764%

Check the converted audio file, such as for sampling rate

$ hexdump -Cn 48 audio2.wav 
00000000  52 49 46 46 b0 68 4f 00  57 41 56 45 66 6d 74 20  |RIFF.hO.WAVEfmt |
00000010  10 00 00 00 01 00 01 00  44 ac 00 00 88 58 01 00  |........D....X..|
00000020  02 00 10 00 4c 49 53 54  2c 01 00 00 49 4e 46 4f  |....LIST,...INFO|

(env-audio) $ python lsaudio.py ../data/audio2.wav
Duration sec:	 59.0
Sample rate:	 44100

Run gcloud ml speech recognize; in this case US English, in other cases select the corresponding English (a dozen variants to select from)

$ gcloud ml speech recognize audio2.wav --language-code='en-US' | tee result$RANDOM.json
{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.96187854,
          "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
        }
      ]
    }
  ]
}

Review the results from the JSON output file

$ jq -r '.results[].alternatives[]|.confidence,.transcript' result26358.json
0.96501887
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you're wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes

Transcription (long/async)

Transcribing longer audio files (more than a minute)

Transfer the audio file to GCP bucket

$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project) 
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][  5.0 MiB/  5.0 MiB]  383.4 KiB/s                                   
Operation completed over 1 objects/5.0 MiB.

Run gcloud ml speech recognize-long-running

$ gcloud ml speech recognize-long-running gs://$(gcloud config get-value project)/audio2.wav --async --language-code='en-US'
Check operation [operations/5263634183516942311] for status.
{
  "name": "5263634183516942311"
}

Check when ready with gcloud ml speech operations wait

$ gcloud ml speech operations wait "5263634183516942311" | tee result$RANDOM.json
Waiting for operation [operations/5263634183516942311] to complete...done.                                                                                                          
{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.961879,
          "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
        }
      ]
    }
  ]
}

Check the outcome with gcloud ml speech operations describe

$ gcloud ml speech operations describe "5263634183516942311" | tee result$RANDOM.json
{
  "done": true,
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "lastUpdateTime": "2019-10-10T15:42:16.492003Z",
    "progressPercent": 100,
    "startTime": "2019-10-10T15:41:57.151427Z"
  },
  "name": "5263634183516942311",
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "confidence": 0.961879,
            "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
          }
        ]
      }
    ]
  }
}

Review the results from the JSON output file

$ jq -r '.response.results[].alternatives[]|.confidence,.transcript' result25359.json
0.961879
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes

Language Detection

multiple-languages
v1p1beta1/RecognitionConfig
- "Optional A list of up to 3 additional BCP-47 language tags, listing possible alternative languages of the supplied audio"

Transfer the audio file to GCP bucket

$ gsutil cp ../data/audio[12].wav gs://$(gcloud config get-value project)  
Copying file://../data/audio1.wav [Content-Type=audio/x-wav]...
Copying file://../data/audio2.wav [Content-Type=audio/x-wav]...                 
\ [2 files][ 10.3 MiB/ 10.3 MiB]                                                
Operation completed over 2 objects/10.3 MiB.

Create JSON formatted request file (request.json)

{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "alternativeLanguageCodes": [ "en-AU", "en-GB", "en-IE" ], "model": "command_and_search" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }

Run curl to access the API

$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize | tee result$RANDOM.json
{
  "results": [
    {
      "alternatives": [
        { "transcript": "checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode",
          "confidence": 0.94958663
        }
      ],
      "languageCode": "en-gb"
    }
  ]
}

Review the results from the JSON output file

$ jq -r '.results[]|.languageCode,.alternatives[].confidence,.alternatives[].transcript' result31483.json 
en-gb
0.94958663
checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode

Diarization

multiple-voices
supported-features-languages
"Cloud Speech-to-Text only supports speaker diarization for transcribing phone calls"

Transfer the audio file to GCP bucket

$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project) 
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][  5.0 MiB/  5.0 MiB]  383.4 KiB/s                                   
Operation completed over 1 objects/5.0 MiB.

Create JSON formatted request file (request.json)
- diarizationSpeakerCount (optional) If set, specifies the estimated number of speakers in the conversation. If not set, defaults to '2'.

{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "diarizationConfig": { "enableSpeakerDiarization": true }, "model": "phone_call" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }

Run curl to access the API

$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize > result$RANDOM.json

$ ls -ltr|tail -1
-rw-r--r--    1 bjro  staff   43322 Oct 11 14:58 result28054.json

Review the results from the JSON output file

$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==1)|.word' result28054.json |tr '\n' ' '; echo
checking in with another show for HP are in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I wanted to you know shoot something up the flag pole here wanted to talk about the state of these days I found old because in podcasting terms I am I've been around since 2004 2000 started producing show since 2005 and a big listing to podcast and 2004 and I came across my own archive from shows that I used to download back then listening to old podcast episodes of 

$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==2)|.word' result28054.json |tr '\n' ' '; echo
podcasting days day 80 since and listen to which I had burn to a CD and I put them on my nose and I've started screaming them while at work the last couple of weeks and I've had up Paul

realBjornRoden/cognitive-actions-audio-gcp.md

Cognitive Artificial Intelligence

Use Cases

GCP (Google Cloud Platform)

APIs

Transcription (short/sync)

Transcription (long/async)

Language Detection

Diarization