Synthetic Voice Bootstrapping for Offline TTS via Donor Speech and Voice Conversion

Overview

This method outlines a scalable and efficient approach to generating fully offline, real-time text-to-speech (TTS) voices, particularly in low-resource languages or dialects, using no real recordings of the target speaker. It combines donor TTS, voice conversion, and lightweight TTS model training to synthesize high-quality, personalized voices that can run on edge devices like the Raspberry Pi 4.

Motivation

Many languages and dialects lack open, high-quality, and privacy-respecting TTS solutions. Traditional approaches often require:

Large corpora of transcribed recordings,
Significant compute for model training,
Ongoing cloud connectivity at runtime.

This method addresses these limitations by:

Removing the need for real speaker recordings (optional target voice),
Generating training data synthetically,
Producing compact, embeddable models optimized for fully offline use.

Methodology

The technique is a multi-stage pipeline consisting of:

1. Donor TTS Synthesis

A high-quality, language-specific TTS system (proprietary or open-source) generates an arbitrary amount of clean audio/text pairs. This serves as the base dataset for further processing.

2. Voice Conversion to Target Voice

Each synthesized utterance is converted into a new voice, using voice cloning or speaker adaptation techniques. This step transfers the prosody, style, and speaker identity of a target voice onto the donor speech.

If no specific target speaker is desired, a generic or randomly sampled speaker identity can be used.
For ethical use, all real voices used must be public domain or have documented consent.

3. ViTS Training

The cloned dataset is used to train a lightweight, end-to-end TTS model based on ViTS (Voice Transformer TTS), a non-autoregressive transformer architecture optimized for speed and low memory footprint.

Models are quantized and optimized for real-time inference on CPUs.
Final models are typically under 100MB and require no internet access.

Advantages

Zero Real Data Required Enables voice creation even where no recordings exist for a speaker or language.
Fully Offline Execution Models are deployable on edge devices without cloud access.
Customizable Voices Voices can be stylized or personalized without retraining donor models.
Privacy-Preserving Final voices are not direct replicas of real individuals unless explicitly intended and authorized.
Rapid Language Expansion Enables TTS deployment in underrepresented or low-resource linguistic communities.

Ethical Considerations

This method respects individual voice rights and emphasizes responsible AI deployment:

Real human voices are only used with explicit consent or when in the public domain.
The voice conversion model acts as a filter, intentionally reducing fidelity to the original voice where anonymization is required.
Final TTS models are typically non-identifiable, ensuring that output speech is synthetic and not a replica of any known speaker.

Use Cases

Open-source voice assistants (e.g., OpenVoiceOS)
Home automations and domotics (e.g., Home Assistant)
Real-time AAC (Augmentative and Alternative Communication) tools
Embedded IoT devices and robotics
Video game NPCs and interactive characters
Educational and accessibility platforms

Status

This approach has been successfully applied to European Portuguese, resulting in the creation of several unique, high-quality voices for offline deployment. Expansion to other languages and voice types is ongoing.

For more information or to request a demonstration, please contact us. Research inquiries welcome.

JarbasAl/synth_voice_whitepaper.md