Skip to content

Instantly share code, notes, and snippets.

@realBjornRoden
Created October 13, 2019 12:55
Show Gist options
  • Save realBjornRoden/3a2975556b4f3abb606577d87fee4234 to your computer and use it in GitHub Desktop.
Save realBjornRoden/3a2975556b4f3abb606577d87fee4234 to your computer and use it in GitHub Desktop.

Cognitive Artificial Intelligence

  • Cloud Vendor Based NoOps

Use Cases

  1. Transcription
  2. Diarization
  3. Language Detection

GCP (Google Cloud Platform)

Model Description
command_and_search Best for short queries such as voice commands or voice search.
phone_call Best for audio that originated from a phone call (typically recorded at an 8khz sampling rate).
video Best for audio that originated from from video or includes multiple speakers. Ideally the audio is recorded at a 16khz or greater sampling rate. This is a premium model that costs more than the standard rate.
default Best for audio that is not one of the specific audio models. For example, long-form audio. Ideally the audio is high-fidelity, recorded at a 16khz or greater sampling rate.
  1. Create project; service account; download service account key file and enable API before-you-begin

  2. Authenticate CLI session with gcloud auth login

  3. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the location of the service account key file

    export GOOGLE_APPLICATION_CREDENTIALS=$PWD/cognitive-aab254879251.json
    
  4. Check the currently active project

    $ gcloud config get-value project 
    bungabunga-123456
    
  5. Set the current project

    $ gcloud projects list
    PROJECT_ID          NAME                PROJECT_NUMBER
    cognitive-254305    cognitive           711533833686
    bungabunga-123456   bungabunga          400688388535
    
    $ gcloud config set project cognitive-254305
    Updated property [core/project].
    
    $ gcloud config get-value project 
    cognitive-254305
    

APIs

  • Enable API
  • Check if available, enable APIs and check if enabled, required: texttospeech.googleapis.com and speech.googleapis.com
$ gcloud services list --available --filter "texttospeech.googleapis.com OR speech.googleapis.com"
NAME                         TITLE
speech.googleapis.com        Cloud Speech-to-Text API
texttospeech.googleapis.com  Cloud Text-to-Speech API

$ gcloud services enable texttospeech.googleapis.com speech.googleapis.com
Operation "operations/acf.fa58e2e5-1830-41f2-aafe-991d716fe61f" finished successfully.

$ gcloud services list --enabled --filter "texttospeech.googleapis.com OR speech.googleapis.com"  
NAME                         TITLE
speech.googleapis.com        Cloud Speech-to-Text API
texttospeech.googleapis.com  Cloud Text-to-Speech API

Transcription (short/sync)

  • Transcribing short audio files (less than a minute)

  • "An asynchronous Speech-to-Text API request to the LongRunningRecognize method is identical in form to a synchronous Speech-to-Text API request."

  • The payload size limit: 10485760 bytes.

  • Get a sample audio file

$ wget https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
--2019-10-10 18:19:50--  https://ia803009.us.archive.org/29/items/hpr2798/hpr2798.wav
Resolving ia803009.us.archive.org (ia803009.us.archive.org)... 207.241.233.29
Connecting to ia803009.us.archive.org (ia803009.us.archive.org)|207.241.233.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75388322 (72M) [audio/x-wav]
Saving to: 'hpr2798.wav'

hpr2798.wav                                   100%[==============================================================================================>]  71.90M  2.56MB/s    in 1m 52s  

2019-10-10 18:21:43 (659 KB/s) - 'hpr2798.wav' saved [75388322/75388322]
  • Check the audio file, such as for sampling rate
    • "Sample rates between 8000 Hz and 48000 Hz are supported within Cloud Speech-to-Text."
    • In below, using audio_metadata
(env-audio) $ python lsaudio.py ../data/hpr2798.wav
Duration sec:	 854.7273015873016
Sample rate:	 44100
  • Cut the file to size from a specific starting point, and convert to WAV format with one audio channel excluding any video (if there was any)
    • Audio options: -aframes number set the number of audio frames to output -aq quality set audio quality (codec-specific) -ar rate set audio sampling rate (in Hz) -ac channels set number of audio channels -an disable audio -acodec codec force audio codec ('copy' to copy stream) -vol volume change audio volume (256=normal) -af filter_graph set audio filters
$ ffmpeg -hide_banner -i ../data/hpr2798.wav -ss 00:01:14 -t 59 -ac 1 -vn audio2.wav
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'hpr2798.wav':
  Metadata:
    album           : Hacker Public Radio
    artist          : knightwise
    comment         : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
    genre           : Podcast
    title           : Should Podcasters be Pirates ?
    track           : 2798
    date            : 2019
  Duration: 00:14:14.73, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Output #0, wav, to 'audio2.wav':
  Metadata:
    IPRD            : Hacker Public Radio
    IART            : knightwise
    ICMT            : http://hackerpublicradio.org Explicit; Knightwise waxes nostalgically on the early days of podcasting and wonders if we all sold out?
    IGNR            : Podcast
    INAM            : Should Podcasters be Pirates ?
    IPRT            : 2798
    ICRD            : 2019
    ISFT            : Lavf58.33.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, mono, s16, 705 kb/s
    Metadata:
      encoder         : Lavc58.59.102 pcm_s16le
size=    5082kB time=00:00:59.00 bitrate= 705.6kbits/s speed=1.4e+03x    
video:0kB audio:5082kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.006764%
  • Check the converted audio file, such as for sampling rate
$ hexdump -Cn 48 audio2.wav 
00000000  52 49 46 46 b0 68 4f 00  57 41 56 45 66 6d 74 20  |RIFF.hO.WAVEfmt |
00000010  10 00 00 00 01 00 01 00  44 ac 00 00 88 58 01 00  |........D....X..|
00000020  02 00 10 00 4c 49 53 54  2c 01 00 00 49 4e 46 4f  |....LIST,...INFO|

(env-audio) $ python lsaudio.py ../data/audio2.wav
Duration sec:	 59.0
Sample rate:	 44100
  • Run gcloud ml speech recognize; in this case US English, in other cases select the corresponding English (a dozen variants to select from)
$ gcloud ml speech recognize audio2.wav --language-code='en-US' | tee result$RANDOM.json
{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.96187854,
          "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
        }
      ]
    }
  ]
}
  • Review the results from the JSON output file
$ jq -r '.results[].alternatives[]|.confidence,.transcript' result26358.json
0.96501887
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you're wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes

Transcription (long/async)


  • Transfer the audio file to GCP bucket
$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project) 
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][  5.0 MiB/  5.0 MiB]  383.4 KiB/s                                   
Operation completed over 1 objects/5.0 MiB.                                      
  • Run gcloud ml speech recognize-long-running
$ gcloud ml speech recognize-long-running gs://$(gcloud config get-value project)/audio2.wav --async --language-code='en-US'
Check operation [operations/5263634183516942311] for status.
{
  "name": "5263634183516942311"
}
  • Check when ready with gcloud ml speech operations wait
$ gcloud ml speech operations wait "5263634183516942311" | tee result$RANDOM.json
Waiting for operation [operations/5263634183516942311] to complete...done.                                                                                                          
{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.961879,
          "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
        }
      ]
    }
  ]
}
  • Check the outcome with gcloud ml speech operations describe
$ gcloud ml speech operations describe "5263634183516942311" | tee result$RANDOM.json
{
  "done": true,
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "lastUpdateTime": "2019-10-10T15:42:16.492003Z",
    "progressPercent": 100,
    "startTime": "2019-10-10T15:41:57.151427Z"
  },
  "name": "5263634183516942311",
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "confidence": 0.961879,
            "transcript": "checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes"
          }
        ]
      }
    ]
  }
}
  • Review the results from the JSON output file
$ jq -r '.response.results[].alternatives[]|.confidence,.transcript' result25359.json
0.961879
checking in with another show for HPR in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I want to do you know shoot something up the flagpole you wanted to talk about the state of podcasting these days these days I sound old because in podcasting terms I am I've been around since 2004 mm started producing show since 2005 and have been listening to podcast daily since 2004 I came across my archives from shows that I used to download back then and listen to which I had burned to a CD and put them on my nose and I've started streaming them while at work the last couple of weeks and I've had a ball listening to old podcast episodes

Language Detection


  • Transfer the audio file to GCP bucket
$ gsutil cp ../data/audio[12].wav gs://$(gcloud config get-value project)  
Copying file://../data/audio1.wav [Content-Type=audio/x-wav]...
Copying file://../data/audio2.wav [Content-Type=audio/x-wav]...                 
\ [2 files][ 10.3 MiB/ 10.3 MiB]                                                
Operation completed over 2 objects/10.3 MiB.                                     
  • Create JSON formatted request file (request.json)
{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "alternativeLanguageCodes": [ "en-AU", "en-GB", "en-IE" ], "model": "command_and_search" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }
  • Run curl to access the API
$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize | tee result$RANDOM.json
{
  "results": [
    {
      "alternatives": [
        { "transcript": "checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode",
          "confidence": 0.94958663
        }
      ],
      "languageCode": "en-gb"
    }
  ]
}

  • Review the results from the JSON output file
$ jq -r '.results[]|.languageCode,.alternatives[].confidence,.alternatives[].transcript' result31483.json 
en-gb
0.94958663
checking in with another show for HP are in the car on my way to a clients can be a short show on think I'm going to be there in 10 minutes but I want to do you know should something up the flagpole here to talk about the state of podcast in these days these days I sound old because in podcasting terms I am I've been around since 2004 to 2000 started producing shows since 2005 and have been listening to podcast daily since 2004 I came across my own archive from show that I used to download back then and listen to which I had burnt to a CD and I put them on mine and I started screaming them while at work the last couple of weeks and listening to Old Podcast episode

Diarization


  • Transfer the audio file to GCP bucket
$ gsutil cp data/audio2.wav gs://$(gcloud config get-value project) 
Copying file://data/audio2.wav [Content-Type=audio/x-wav]...
| [1 files][  5.0 MiB/  5.0 MiB]  383.4 KiB/s                                   
Operation completed over 1 objects/5.0 MiB.                                      
  • Create JSON formatted request file (request.json)
    • diarizationSpeakerCount (optional) If set, specifies the estimated number of speakers in the conversation. If not set, defaults to '2'.
{ "config": { "encoding":"LINEAR16", "languageCode": "en-US", "diarizationConfig": { "enableSpeakerDiarization": true }, "model": "phone_call" }, "audio": { "uri":"$(gcloud config get-value project)/audio2.wav" } }
  • Run curl to access the API
$ curl -s -H "Content-Type: application/json" -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" -d @request.json https://speech.googleapis.com/v1p1beta1/speech:recognize > result$RANDOM.json

$ ls -ltr|tail -1
-rw-r--r--    1 bjro  staff   43322 Oct 11 14:58 result28054.json
  • Review the results from the JSON output file
$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==1)|.word' result28054.json |tr '\n' ' '; echo
checking in with another show for HP are in the car on my way to a client's going to be a short show I'm think I'm going to be there in 10 minutes but I wanted to you know shoot something up the flag pole here wanted to talk about the state of these days I found old because in podcasting terms I am I've been around since 2004 2000 started producing show since 2005 and a big listing to podcast and 2004 and I came across my own archive from shows that I used to download back then listening to old podcast episodes of 

$ jq -r '.results[].alternatives[].words[]|select(.speakerTag==2)|.word' result28054.json |tr '\n' ' '; echo
podcasting days day 80 since and listen to which I had burn to a CD and I put them on my nose and I've started screaming them while at work the last couple of weeks and I've had up Paul 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment