A React hook for browser speech-to-text that behaves like typing.
The Web Speech API (SpeechRecognition) gives you a stream of results, but wiring it into a text input is surprisingly fiddly:
- Interim results need to stream in so the user sees text appear as they speak — but they also need to be replaced when the recognizer refines them or finalizes them.
- Manual edits must be respected. If the user deletes text or types something, speech results shouldn't overwrite their changes.
- Browsers kill recognition silently. Chrome stops listening after ~60s of silence. The user still sees the mic "on" but nothing happens. You need auto-restart.
- Session restarts must not replay old text. Each new
SpeechRecognitioninstance starts with a fresh result list. If you're tracking the wrong state, old text reappears or current text vanishes.
The hook tracks one number: interimLenRef — how many characters at the end of the text string are interim (not yet finalized).
"I was saying that maybe we should "
← interimLenRef = 0 (no interim)
"I was saying that maybe we should consider the"
^^^^^^^^^^^^^^
interimLenRef = 14 (interim suffix)
"I was saying that maybe we should consider the options"
^^^^^^^^^^^^^^^^^^^^
interimLenRef = 20 (interim updated)
"I was saying that maybe we should consider the options "
← interim finalized,
interimLenRef = 0
When a new speech result arrives, applyUpdate slices off the old interim suffix and appends the new final + interim text:
const base = prev.slice(0, prev.length - interimLenRef.current);
interimLenRef.current = interim.length;
return base + final + interim;When the user manually edits (types, deletes, pastes), onManualEdit() sets interimLenRef to 0. The next speech result just appends — it never touches what the user edited.
Chrome/WebKit will silently end recognition after a silence timeout. The hook detects this via onend and restarts automatically if the user hasn't toggled off. Each new recognition session tracks its own processedCount so it only emits its own results.
import { useState, useCallback } from 'react';
import { useDictation } from './use-dictation';
function ChatInput() {
const [text, setText] = useState('');
const { recording, toggle, applyUpdate, onManualEdit, supported } = useDictation();
const handleUpdate = useCallback((final: string, interim: string) => {
setText(prev => applyUpdate(prev, final, interim));
}, [applyUpdate]);
return (
<div>
<textarea
value={text}
onChange={e => { onManualEdit(); setText(e.target.value); }}
/>
{supported && (
<button onClick={() => toggle(handleUpdate)}>
{recording ? 'Stop' : 'Mic'}
</button>
)}
</div>
);
}const { ... } = useDictation({ lang: 'es-ES' }); // default: 'en-US'| Return value | Type | Description |
|---|---|---|
recording |
boolean |
Whether recognition is active |
toggle(onUpdate) |
(cb) => void |
Start/stop. Pass your update callback |
applyUpdate(prev, final, interim) |
string |
Pure string transform — merges speech into text |
onManualEdit() |
() => void |
Call from onChange to reset interim tracking |
supported |
boolean |
Whether SpeechRecognition exists in this browser |
Why not own the text state? The hook could manage text internally and return [text, setText]. But that makes it hard to compose with other things that touch the same input — paste handlers, autocomplete, form libraries. Instead, applyUpdate is a pure string transform. You own the state; the hook just knows how to merge speech into it.
Why not take a ref? Coupling to a DOM element would limit where you can use it (what about contenteditable? Zustand? A non-React renderer?). The suffix-tracking approach is DOM-independent.
Why no cursor-position insertion? Tracking cursor position across speech results, user edits, and React re-renders is complex and fragile. The suffix approach covers the common case (chat input, search box, any append-oriented input) without that complexity. If you need mid-text insertion, you'd want a different approach entirely (probably execCommand or InputEvent-based).
Why toggle(onUpdate) instead of a stable callback? The update handler often closes over setText, which may change. Passing it at toggle-time means you don't need to worry about stale closures — the latest handler is captured in a ref internally.
- Desktop Chrome (Mac/Windows/Linux): Works well. Uses Google's cloud speech service.
- iOS (any browser): All iOS browsers use WebKit.
continuousmode is less reliable — recognition stops more aggressively on silence. The auto-restart logic handles this, but users may notice brief gaps. - Firefox:
SpeechRecognitionis not supported.supportedwill befalse. - Safari (macOS): Supported since Safari 14.1. Uses Apple's on-device recognition.
MIT