I talk to my computer all day — in a fast mix of Vietnamese and English, full of odd technical names — and it almost always gets it right. People ask how. Here's the whole idea in plain language.
It isn't one piece of magic. It's three simple steps, each one fixing the weakness of the step before it.
My voice first goes to a speech-to-text engine (a service called Soniox). Think of it as a lightning-fast stenographer: it writes down what I say almost the instant I say it.
It's excellent with everyday words. But like any stenographer hearing an unfamiliar name for the first time, it sometimes guesses wrong on unusual ones — a tool called "Claude" comes out as "cloud," or "tmux" becomes "tea mux."
So Step 1 gives me speed, but a rough draft.
That rough draft is then handed to a small, cheap AI model that works like an editor. It doesn't just copy the text — it reads the whole sentence, figures out what I meant from the context, and fixes the slips. "Cross code, fix the bug" becomes "Claude Code, fix the bug." It even smooths my mixed Vietnamese-and-English into one clean instruction.
The clever part: because this model only has to tidy up text (not be a genius), a cheap, fast one is more than enough — and "cheap" means it can run on every single sentence without me ever worrying about the bill.
So Step 2 gives me understanding.
This is the part that makes it feel like it truly knows me.
Every night, a little helper quietly reviews the whole day's conversations, spots the words it kept getting wrong, and updates a personal cheat-sheet — my names, my projects, my slang, the particular way I pronounce things. The next morning, the system is a touch smarter about exactly how I talk. Over weeks, it has gradually learned my personal vocabulary, all on its own.
Most voice tools are one-size-fits-all. Mine is tailored to me, and it keeps getting better.
The trick isn't a single brilliant component — it's the teamwork:
- a fast listener for speed,
- a cheap editor for understanding, and
- an overnight learner that personalizes everything to me.
Each layer is simple and inexpensive on its own. Stacked together, they add up to a voice assistant that understands my messy, bilingual, jargon-heavy speech — and gets a little better every single day.