Skip to content

Instantly share code, notes, and snippets.

@JarbasAl
Last active April 8, 2025 17:34
Show Gist options
  • Save JarbasAl/208217e744504aa900c6d506f9b62fe7 to your computer and use it in GitHub Desktop.
Save JarbasAl/208217e744504aa900c6d506f9b62fe7 to your computer and use it in GitHub Desktop.

Tasks

Emoji Task Type Flow Example
πŸ”Š TTS Text-to-Speech text (EN) β†’ audio (EN)
🌐 T2TT
🌐 MT
Text-to-Text-Translation
Machine Translation
text (FR) β†’ text (EN)
🎀 T2ST Text-to-Speech-Translation text (DE) β†’ audio (EN)
πŸ—£οΈ STT
πŸ—£οΈ ASR
Speech-to-Text
Automatic-speech-recognition
audio (PT) β†’ text (PT)
πŸ—£οΈπŸ“ S2TT Speech-to-Text-Translation audio (FR) β†’ text (EN)
πŸ—£οΈπŸ”„ S2ST Speech-to-Speech-Translation audio (ES) β†’ audio (EN)
🧠 SLI Spoken-Language-Identification audio (unknown) β†’ lang: "pt"
🧾 WLI Written-Language-Identification text (unknown) β†’ lang: "en"
🎯 WW Wake-Word Detection passive audio β†’ hotword β†’ trigger
🎚️ VAD Voice-Activity-Detection audio stream β†’ speech segmenting
πŸ€– QA Question-Answering text (prompt) β†’ text (generated)
✍️ DT Dialog-Transformer text (generated) β†’ text (modified)
✍️ UT Utterance-Transformer text (prompt) β†’ text (modified)

task input type output type input language == output language
πŸ€– Question Answering text text yes
✍️ Dialog Transformer text text yes
🌐 Text-to-text-translation (MT) text text no
πŸ”Š Text-to-speech (TTS) text audio yes
🎀 Text-to-speech-translation (T2ST) text audio no
πŸ—£οΈ Speech-To-Text (STT) audio text yes
πŸ—£οΈπŸ”„ Speech-to-speech-translation (S2ST) audio audio no
πŸ—£οΈπŸ“ Speech-to-text-translation (S2TT) audio text no

Plugin combinations

Combined Plugins Task Description Emoji Task
🌐 MT + πŸ”Š TTS Text-to-Speech Translation 🎀 T2ST
πŸ—£οΈ STT + 🌐 MT Speech-to-Text Translation πŸ—£οΈπŸ“ S2TT
πŸ—£οΈ STT + 🌐 MT + πŸ”Š TTS Speech-to-Speech Translation πŸ—£οΈπŸ”„ S2ST

🎀 Text β†’ Speech Translation (T2ST)

text (DE)
β†’ 🧾 detect written lang (DE)
β†’ 🌐 translate (DE β†’ EN)
β†’ πŸ”Š TTS (EN)
= audio (EN)

πŸ—£οΈπŸ“ Speech β†’ Text Translation (S2TT)

audio (FR)
β†’ 🧠 detect spoken lang (FR)
β†’ πŸ—£οΈ STT (FR)
β†’ 🌐 translate (FR β†’ EN)
= text (EN)

πŸ—£οΈπŸ”„ Speech β†’ Speech Translation (S2ST)

audio (ES)
β†’ 🧠 detect spoken lang (ES)
β†’ πŸ—£οΈ STT (ES)
β†’ 🌐 translate (ES β†’ EN)
β†’ πŸ”Š TTS (EN)
= audio (EN)

Language Detection

If input language is not known before inference it can be detected via SLI and WLI plugins allowing for dynamic language/multi-user/multilingual setups

Combined Plugins Task Description Emoji Task
🧾 + πŸ”Š TTS Text-to-Speech πŸ”Š TTS (multilingual)
🧾 + 🌐 MT Text Translation 🌐 MT (multilingual)
🧾 + 🌐 MT + πŸ”Š TTS Text-to-Speech Translation 🎀 T2ST (multilingual)
🧠 SLI + πŸ—£οΈ STT Speech-to-Text πŸ—£οΈ STT (multilingual)
🧠 SLI + πŸ—£οΈ STT + 🌐 MT Speech-to-Text Translation πŸ—£οΈπŸ“ S2TT (multilingual)
🧠 SLI + πŸ—£οΈ STT + 🌐 MT + πŸ”Š TTS Speech-to-Speech Translation πŸ—£οΈπŸ”„ S2ST (multilingual)

Generative AI / Persona

Plugin Purpose Position
✍️ UT Normalize / rewrite user input (ex: β€œcan u pls tell me?” β†’ β€œplease tell me”) Before πŸ€– QA
πŸ€– QA Core NLU + response generation (LLM / skill selection / intent) Middle
✍️ DT Rewrite generated response (ex: dry β†’ humorous, formal β†’ friendly) After πŸ€– QA

for OVOS purposes consider πŸ€– QA to be equivalent to ovos-core, in OVOS this step uses intent classification to select a skill that is responsible for executing some action and generating a dialog

✍️ DT is used after ovos-core generated an answer to rewrite it and give it a personality


🎯 Voice Pipeline: STT + Generation + TTS (Variants)

Pipeline Description
audio β†’ 🎯 WW + 🎚️ VAD + πŸ—£οΈ STT β†’ πŸ€– QA β†’ πŸ”Š TTS β†’ audio Direct spoken Q&A
audio β†’ 🎯 WW + 🎚️ VAD + πŸ—£οΈ STT β†’ ✍️ UT β†’ πŸ€– QA β†’ πŸ”Š TTS β†’ audio Input cleanup for better NLU
audio β†’ 🎯 WW + 🎚️ VAD + πŸ—£οΈ STT β†’ ✍️ UT β†’ πŸ€– QA β†’ ✍️ DT β†’ πŸ”Š TTS β†’ audio Voice assistant with emotion/tone control
audio β†’ 🎯 WW + 🎚️ VAD + 🧠 SLI + πŸ—£οΈ STT β†’ πŸ€– QA β†’ πŸ”Š TTS Multilingual support
audio β†’ 🎯 WW + 🎚️ VAD + 🧠 SLI + πŸ—£οΈ STT + 🌐 MT + ✍️ UT β†’ πŸ€– QA β†’ ✍️ DT + 🌐 MT+ πŸ”Š TTS Fully featured multilingual, polyglot (bidirectional translation), personalized voice assistant

πŸ€– Generative AI Task Flow Examples

Voice Input β†’ Answer (Monolingual)

audio (EN)
β†’ 🎯 WW + 🎚️ VAD
β†’ πŸ—£οΈ STT (EN)
β†’ πŸ€– QA (EN β†’ EN)
β†’ πŸ”Š TTS (EN)
= audio (EN)

Voice Input β†’ Persona Reply (Multilingual)

audio (PT)
β†’ 🎯 WW + 🎚️ VAD
β†’ 🧠 SLI: lang="pt"
β†’ πŸ—£οΈ STT (PT)
β†’ ✍️ UT (normalize)
β†’ 🌐 MT (PT β†’ EN)
β†’ πŸ€– QA (EN β†’ EN)
β†’ ✍️ DT (personality/style)
β†’ 🌐 MT (EN β†’ PT)
β†’ πŸ”Š TTS (PT)
= audio (PT)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment