Emoji | Task Type | Flow Example |
---|---|---|
π TTS | Text-to-Speech | text (EN) β audio (EN) |
π T2TT π MT |
Text-to-Text-Translation Machine Translation |
text (FR) β text (EN) |
π€ T2ST | Text-to-Speech-Translation | text (DE) β audio (EN) |
π£οΈ STT π£οΈ ASR |
Speech-to-Text Automatic-speech-recognition |
audio (PT) β text (PT) |
π£οΈπ S2TT | Speech-to-Text-Translation | audio (FR) β text (EN) |
π£οΈπ S2ST | Speech-to-Speech-Translation | audio (ES) β audio (EN) |
π§ SLI | Spoken-Language-Identification | audio (unknown) β lang: "pt" |
π§Ύ WLI | Written-Language-Identification | text (unknown) β lang: "en" |
π― WW | Wake-Word Detection | passive audio β hotword β trigger |
ποΈ VAD | Voice-Activity-Detection | audio stream β speech segmenting |
π€ QA | Question-Answering | text (prompt) β text (generated) |
βοΈ DT | Dialog-Transformer | text (generated) β text (modified) |
βοΈ UT | Utterance-Transformer | text (prompt) β text (modified) |
task | input type | output type | input language == output language |
---|---|---|---|
π€ Question Answering | text | text | yes |
βοΈ Dialog Transformer | text | text | yes |
π Text-to-text-translation (MT) | text | text | no |
π Text-to-speech (TTS) | text | audio | yes |
π€ Text-to-speech-translation (T2ST) | text | audio | no |
π£οΈ Speech-To-Text (STT) | audio | text | yes |
π£οΈπ Speech-to-speech-translation (S2ST) | audio | audio | no |
π£οΈπ Speech-to-text-translation (S2TT) | audio | text | no |
Combined Plugins | Task Description | Emoji Task |
---|---|---|
π MT + π TTS | Text-to-Speech Translation | π€ T2ST |
π£οΈ STT + π MT | Speech-to-Text Translation | π£οΈπ S2TT |
π£οΈ STT + π MT + π TTS | Speech-to-Speech Translation | π£οΈπ S2ST |
text (DE)
β π§Ύ detect written lang (DE)
β π translate (DE β EN)
β π TTS (EN)
= audio (EN)
audio (FR)
β π§ detect spoken lang (FR)
β π£οΈ STT (FR)
β π translate (FR β EN)
= text (EN)
audio (ES)
β π§ detect spoken lang (ES)
β π£οΈ STT (ES)
β π translate (ES β EN)
β π TTS (EN)
= audio (EN)
If input language is not known before inference it can be detected via SLI and WLI plugins allowing for dynamic language/multi-user/multilingual setups
Combined Plugins | Task Description | Emoji Task |
---|---|---|
π§Ύ + π TTS | Text-to-Speech | π TTS (multilingual) |
π§Ύ + π MT | Text Translation | π MT (multilingual) |
π§Ύ + π MT + π TTS | Text-to-Speech Translation | π€ T2ST (multilingual) |
π§ SLI + π£οΈ STT | Speech-to-Text | π£οΈ STT (multilingual) |
π§ SLI + π£οΈ STT + π MT | Speech-to-Text Translation | π£οΈπ S2TT (multilingual) |
π§ SLI + π£οΈ STT + π MT + π TTS | Speech-to-Speech Translation | π£οΈπ S2ST (multilingual) |
Plugin | Purpose | Position |
---|---|---|
βοΈ UT | Normalize / rewrite user input (ex: βcan u pls tell me?β β βplease tell meβ) | Before π€ QA |
π€ QA | Core NLU + response generation (LLM / skill selection / intent) | Middle |
βοΈ DT | Rewrite generated response (ex: dry β humorous, formal β friendly) | After π€ QA |
for OVOS purposes consider π€ QA to be equivalent to ovos-core
, in OVOS this step uses intent classification to select a skill that is responsible for executing some action and generating a dialog
βοΈ DT is used after ovos-core
generated an answer to rewrite it and give it a personality
Pipeline | Description |
---|---|
audio β π― WW + ποΈ VAD + π£οΈ STT β π€ QA β π TTS β audio | Direct spoken Q&A |
audio β π― WW + ποΈ VAD + π£οΈ STT β βοΈ UT β π€ QA β π TTS β audio | Input cleanup for better NLU |
audio β π― WW + ποΈ VAD + π£οΈ STT β βοΈ UT β π€ QA β βοΈ DT β π TTS β audio | Voice assistant with emotion/tone control |
audio β π― WW + ποΈ VAD + π§ SLI + π£οΈ STT β π€ QA β π TTS | Multilingual support |
audio β π― WW + ποΈ VAD + π§ SLI + π£οΈ STT + π MT + βοΈ UT β π€ QA β βοΈ DT + π MT+ π TTS | Fully featured multilingual, polyglot (bidirectional translation), personalized voice assistant |
audio (EN)
β π― WW + ποΈ VAD
β π£οΈ STT (EN)
β π€ QA (EN β EN)
β π TTS (EN)
= audio (EN)
audio (PT)
β π― WW + ποΈ VAD
β π§ SLI: lang="pt"
β π£οΈ STT (PT)
β βοΈ UT (normalize)
β π MT (PT β EN)
β π€ QA (EN β EN)
β βοΈ DT (personality/style)
β π MT (EN β PT)
β π TTS (PT)
= audio (PT)