Skip to content

Instantly share code, notes, and snippets.

@guest271314
Last active September 13, 2024 04:47
Show Gist options
  • Select an option

  • Save guest271314/d449cc9c61ae61148923f2e9e474d6f0 to your computer and use it in GitHub Desktop.

Select an option

Save guest271314/d449cc9c61ae61148923f2e9e474d6f0 to your computer and use it in GitHub Desktop.
Exploiting Web Speech API to execute arbitrary native code

Exploiting Web Speech API to execute arbitrary native code

If you have used Web Speech API on a desktop browser you have either 1) used Google voices, which are the default on Chrome, which sends your text input to a remote Google server and sends back the audio to the browser; or 2) used Brailcom's Speech Dispatcher by launching chrome with --enable-speech-dispatcher flag.

With all of the advertising and slogans about "artificial intelligence" one might think that Text-To-Speech and -Speech-To-Text would be happening in the browser by now. It's not. See Re: Issue 263510047: Release TTS and STT source code and Google voices as FOSS,

Speech Dispatcher, which both Chrome and Firefox depend on for local speech synthesis interface, creates a socket connection between the application and the browser, exposed via window.speechSynthesis object.

If you create a module for Speech Dispatcher locally, for example, using piper you'll notice the GenericExecuteSynth option, where we can execute arbitrary code by executing window.speechSyntehsis.speak() in the browser.

At piper.conf

Debug 0

GenericExecuteSynth "printf %s \'$DATA\' | /home/user/ext/piper/piper -q --length_scale 1 --sentence_silence 0 --model /home/user/ext/$VOICE --output_raw | tee /home/user/tab/output.pcm | printf 0 | aplay -q -r 0 -R 0 -T 0 -f S16_LE -t raw -"

# only use medium quality voices to respect the 22050 rate for aplay in the command above.

# spd-say -o piper -l en -t male1 "test"

# arecord -d 1 -t raw -r 22050 -f S16_LE silence.pcm

GenericCmdDependency "piper"
GenericCmdDependency "aplay"
GenericCmdDependency "printf"
GenericSoundIconFolder "/usr/share/sounds/sound-icons/"

GenericPunctSome "--punct=\"()<>[]{}\""
GenericPunctMost "--punct=\"()[]{};:\""
GenericPunctAll "--punct"

# Git rid of extended silence between sentences
GenericDelimiters "|"

GenericLanguage  "en" "en_US" "utf-8"

AddVoice        "en"    "MALE1"         "en_US-hfc_male-medium.onnx"
AddVoice        "en"    "FEMALE1"       "en_US-hfc_female-medium.onnx"

DefaultVoice    "en_US-hfc_male-medium.onnx"

#GenericRateForceInteger 1
#GenericRateAdd 1
#GenericRateMultiply 100

We won't be doing anything malicious here.

Web Speech API provides no means to capture the live audio output of window.speechSynthesis.speak(), see MediaStream, ArrayBuffer, Blob audio result from speak() for recording?.

If you're inclined, use your imagination to test how far and ranging you can go. I doubt you'll find any limitations to what you can execute simply be executing window.speechSyntehsis.speak() on a Chromium-based browser (Chrome, Brave, Edge, Opera).

We're just going to create a WAV file from text set to SpeechSynthesisUtterance.text, using tee write that WAV file to a local directory that is also an unpacked Web extension directory, play 1 second of silence using aplay, then use fetch() to get the local WAV file from the Web extension directory where

"web_accessible_resources": [{
   "resources": ["*.wav", "*.pcm"],
   "matches": ["<all_urls>"],
   "extensions": [ ]
}]

is in the manifest.json. We then decode the WAV, and play the WAV file in the browser using Web Audio API.

At DevTools console

{
  let voices = speechSynthesis.getVoices().filter( ({name}) => name.includes("piper"));
  let u = new SpeechSynthesisUtterance();
  u.onend = async (e) => {
    console.log(e.type);
    const pcm = await (await fetch("chrome-extension://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/output.pcm")).arrayBuffer();

    let ac = new AudioContext({
      latencyHint: 0,
      sampleRate: 22050
    });
    
    let ad = new AudioData({
      sampleRate: 22050,
      numberOfChannels: 1,
      numberOfFrames: pcm.byteLength / 2,
      timestamp: 0,
      format: "s16",
      data: new Int16Array(pcm)
    });
    
    let ab = new ArrayBuffer(ad.allocationSize({
      planeIndex: 0,
      format: 'f32'
    }));
    
    ad.copyTo(ab, {
      planeIndex: 0,
      format: 'f32'
    });
    
    let floats = new Float32Array(ab);
    
    let buffer = new AudioBuffer({
      length: floats.length,
      numberOfChannels: 1,
      sampleRate: 22050,
    });
    
    buffer.getChannelData(0).set(floats);
    
    let absn = new AudioBufferSourceNode(ac,{
      buffer
    });
    
    absn.connect(ac.destination);
    
    absn.addEventListener("ended", async (e) => {
      await ac.close();
    }
    );
    
    absn.start();

  }
  u.voice = voices[1];
  u.text = `von Braun believed in testing. I cannot
emphasize that term enough – test, test,
test. Test to the point it breaks.

- Ed Buckbee, NASA Public Affairs Officer, Chasing the Moon`.replace(/\n/g, " ");
  speechSynthesis.speak(u);

  console.log(JSON.stringify(voices.map( ({default: _default, lang, localService, name, voiceURI}) => ({
    _default,
    lang,
    localService,
    name,
    voiceURI
  })), null, 2));
}

That's it. Another way to execute arbitrary local or remote code from the browser using a Web API, in this case, the Web Speech API, that has not been updated substantially for around a decade.

Have fun hacking Web Speech API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment