Text-to-Speech | Ziyang Lin

Modern TTS Architecture Comparison: In-Depth Analysis of Ten Speech Synthesis Models

Fri, 27 Jun 2025 07:02:00 +0000

1. Kokoro: Lightweight Efficient TTS

1.1 Architecture Design

Kokoro adopts a concise and efficient architecture design, with its core structure as follows:

graph TD
A[Text] --> B[G2P Phoneme Processing - misaki]
B --> C[StyleTTS2 Style Decoder]
C --> D[ISTFTNet Vocoder]
D --> E[Waveform - 24kHz]

Kokoro's features:

No traditional Encoder (directly processes phonemes)
Decoder uses feed-forward non-recursive structure (Conv1D/FFN)
Does not use transformer, autoregression, or diffusion
Style and prosody are injected as conditional vectors in the decoder
Uses ISTFTNet as vocoder: lightweight, fast, supports ONNX inference

1.2 Technical Advantages

Kokoro provides solutions to multiple pain points of traditional TTS models:

Target Issue	Kokoro's Solution
Limited voice style diversity	Built-in style embedding and multiple speaker options (48+)
High deployment threshold	Full Python/PyTorch + ONNX support, one-line pip installation
Slow generation speed	Uses non-autoregressive structure + lightweight vocoder (ISTFTNet)
Lack of control capability	Explicitly models pitch/duration/energy and other prosody parameters
Unclear licensing	Uses Apache 2.0, commercial-friendly and fine-tunable

1.3 Limitation Analysis

Despite Kokoro's excellence in efficiency and deployment convenience, it has some notable limitations:

1.3.1 Strong Structural Parallelism but Weak Context Modeling

No encoder → Cannot understand whole-sentence context, e.g., “He is happy today” vs “He is angry today” cannot naturally vary in intonation
Phonemes are sent directly to the decoder, without linguistic hierarchical structure
In long texts or sentences with strong contextual dependencies, pause rhythm lacks semantic awareness
Parallel generation can produce output at once without token-by-token inference, but semantic consistency is poor and cannot simulate paragraph tone progression

1.3.2 Limited Acoustic Modeling Capability

Sound details (such as breathiness, intonation contour) are not as good as VALL-E, StyleTTS2, Bark
Uses the classic TTS route of “decoder predicts Mel + vocoder synthesis,” acoustic precision is approaching its upper limit
Prosody prediction is controllable but limited in quality (model itself is too small)

1.3.3 Trade-off Between Audio Quality and Model Complexity

Sacrifices some audio quality to maintain speed
May produce artifacts in high-frequency bands, nasal sounds, and plosives
Limited emotional expression intensity, cannot do “roaring, crying” and other extreme styles

2. CosyVoice: LLM-Based Unified Architecture

2.1 Architecture Design

CosyVoice adopts a unified architecture design similar to LLMs, integrating text and audio processing into a single framework:

graph TD
A[Text] --> B[Tokenizer]
B --> C[Text token]
D[Audio] --> E[WavTokenizer]
E --> F[Acoustic token]
C --> G[LLaMA Transformer]
G1[Prosody token] --> G
G2[Speaker prompt] --> G
F --> G
G --> H[Predict Acoustic token]
H --> I[Vocoder]
I --> J[Audio output]

Main modules and their functions:

Module	Implementation Details
Tokenizer	Uses standard BPE tokenizer, converts text to tokens (supports Chinese-English mixed input)
WavTokenizer	Discretizes audio into tokens (replacing traditional Mel), interfaces with Transformer decoder
Transformer Model	Multimodal autoregressive Transformer, structure similar to LLaMA, fuses text and audio tokens
Prosody Token	Controls <laugh> <pause> <whisper> and other tones through token insertion rather than model structure modeling
Vocoder	Supports HiFi-GAN or SNAC: restores waveforms from audio tokens, lightweight, supports low-latency deployment

2.2 Technical Advantages

CosyVoice provides innovative solutions to multiple issues in traditional TTS architectures:

Target Issue	CosyVoice's Solution
Complex traditional structure, slow inference	Uses unified Transformer architecture, no encoder, direct token input/output, simplified structure
Lack of prosody control	Inserts prosody tokens (like <laugh>) for expression control, no need to train dedicated emotion models
Upstream/downstream inconsistency, uncontrollable TTS	Both text and audio are discretized into tokens, unified modeling logic, supports prompt guidance and controllable generation
High difficulty in multilingual modeling	Supports Chinese-English bilingual training, text tokenizer natively supports multiple languages, unified expression at token layer
Lack of conversational speech capability	Generation method compatible with LLMs, can integrate chat context to construct speech dialogue system framework

2.3 Limitation Analysis

While CosyVoice has significant advantages in unified architecture and flexibility, it also faces some challenges in practical applications:

2.3.1 Autoregressive Structure Leads to Low Parallelism

Model uses LLM-like token-by-token autoregressive generation method
Must generate sequentially, cannot process long sentences in parallel
Inference speed significantly slower than non-autoregressive models like Fastspeech2/StyleTTS2
Fundamental limitation comes from Transformer decoder architecture: must wait for previous token generation before predicting the next one

2.3.2 Prosody Control Mechanism Relies on Prompts, Not Suitable for Stable Production

Style control depends on manual insertion of prosody tokens
Style output quality highly dependent on “prompt crafting techniques”
Compared to StyleTTS2's direct input of style vector/embedding, control is less structured, lacking learnability and robustness
Difficult to automatically build stable output flow in engineering

2.3.3 Lacks Speaker Transfer Capability

No explicit support for speaker embedding
Cannot implement voice cloning through reference audio
Capability clearly insufficient when highly personalized speech is needed (e.g., virtual characters, customer-customized voices)

3. ChatTTS: Modular Diffusion Model

3.1 Architecture Design

ChatTTS adopts a modular design approach, combining the advantages of diffusion models:

graph TD
A[Text] --> B[Text Encoder]
B --> C[Latent Diffusion Duration Predictor - LDDP]
C --> D[Acoustic Encoder - generates speech tokens]
D --> E[HiFi-GAN vocoder]
E --> F[Audio]

Main modules and their functions:

Module	Implementation Details
Tokenizer	Uses standard BPE tokenizer, converts text to tokens (supports Chinese-English mixed input)
WavTokenizer	Discretizes audio into tokens (replacing Mel), as decoder target
Text Encoder	Encodes text tokens, provides context vector representation for subsequent modules
Duration Predictor (LDDP)	Uses diffusion model to predict token duration, achieving natural prosody (rhythm modeling)
Acoustic Decoder	Autoregressively generates speech tokens, constructing speech representation frame by frame
Prosody Token	Controls <laugh> <pause> <shout> and other tokens, incorporating sentence expression tone and rhythm
Vocoder	Supports HiFi-GAN/EnCodec, restores waveforms from speech tokens, flexible deployment

3.2 Technical Advantages

ChatTTS provides solutions to module dependency and inference pipeline issues in TTS models:

Issue	ChatTTS's Strategy
Heavy module dependencies	Decouples modules for modular training: supports independent training of tokenizer, diffusion-based duration model, vocoder, and connects through intermediate tokens, reducing end-to-end coupling risk
Long inference pipeline	Uses unified token expression structure (text token → speech token → waveform), forming standard token flow path, enhancing module collaboration efficiency; supports HiFi-GAN to simplify backend
High fine-tuning difficulty	Explicit control logic: expresses style through prosody token insertion, no need for additional style models, reducing data dependency and fine-tuning complexity

3.3 Limitation Analysis

ChatTTS has advantages in modular design but also faces some practical application challenges:

3.3.1 Autoregressive Structure Leads to Low Parallelism

Uses Transformer Decoder + autoregressive mechanism, generating tokens one by one
Must wait for the completion of the previous speech token before generating the next one

3.3.2 Complex Architecture, Multiple Modules, High Maintenance Difficulty

Heavy module dependencies: includes tokenizer, diffusion predictor, decoder, vocoder, and other components, difficult to train and optimize uniformly
Long inference pipeline: errors in any module will affect speech quality and timing control
High fine-tuning difficulty: control tokens and style embedding effects have strong data dependency

3.3.3 Control Tokens Have Weak Interpretability, Generation Is Unstable

Control tokens lack standardization, e.g., [laugh], [pause], [sad] insertions show inconsistent performance, requiring manual parameter tuning
Token combination effects are complex, multiple control tokens combined may produce unexpected speech effects (such as rhythm disorder)

4. Chatterbox: Multi-Module Fusion Model

4.1 Architecture Design

Chatterbox adopts a multi-module fusion design approach, combining various advanced technologies:

graph TD
A[Text] --> B[Semantic token encoding]
B --> C[s3gen generates speech tokens]
C --> D[cosyvoice decoding]
D --> E[HiFi-GAN]
E --> F[Audio output]

Main modules and their functions:

Module	Algorithm Approach
Text Encoder (LLM)	Uses language model (like LLaMA) to encode text
s3gen (Speech Semantic Sequence Generator)	Mimics VALL-E concept, predicts discrete speech tokens
t3_cfg (TTS Config)	Model structure definition, including vocoder type, tokenizer configuration, etc.
CosyVoice (Decoder)	Non-autoregressive decoder
HiFi-GAN (Vocoder)	Convolutional + discriminator generator network

4.2 Technical Advantages

Chatterbox provides solutions to multiple issues in traditional TTS models:

Target Issue	Chatterbox's Strategy
Difficult prosody control	Inserts prosody tokens for expression control, no need for additional labels or gating models
Text and speech structure separation	Uses discrete speech tokens to connect to unified token pipeline, enhancing upstream-downstream coordination
Poor multilingual support	Supports native Chinese-English mixed input, unified token layer expression structure
Lack of context/dialogue support	Integrates LLM output token sequences, laying foundation for dialogue speech framework

4.3 Limitation Analysis

Chatterbox has innovations in multi-module fusion but also faces some practical application challenges:

4.3.1 Intermediate Tokens Lack Transparency

s3gen's speech tokens lack clear interpretability, not conducive to later debugging and control of tone, emotion, and other attributes

4.3.2 Insufficient Context Management Capability

Current design tends toward single-round inference, does not support long dialogue caching, difficult to use in multi-round voice dialogue agent scenarios

4.3.3 Long Chain, Dependent on Multiple Modules

Multi-module combination (LLM + s3gen + CosyVoice + vocoder), overall model robustness decreases, difficult to optimize as a whole

5. Dia: Lightweight Cross-Platform TTS

5.1 Architecture Design

Dia adopts a lightweight design suitable for cross-platform deployment:

graph TD
A[Text] --> B[Tokenizer]
B --> C[Text Encoder - GPT-style]
C --> D[Prosody Module]
D --> E[Acoustic Decoder - generates speech tokens]
E --> F{Vocoder}
F -->|HiFi-GAN| G[Audio]
F -->|SNAC| G

Main modules and their functions:

Module	Description
Text Encoder	Mostly GPT-style structures, modeling input text; captures context semantics and intonation cues
Prosody Module	Controls tone, rhythm, emotional state (possibly embedding + classifier)
Decoder	Maps encoded semantics to acoustic tokens (possibly codec representation or Mel features)
Vocoder	Commonly uses HiFi-GAN, converts acoustic tokens to playable audio (.wav or .mp3)

5.2 Technical Advantages

Dia provides solutions to multiple issues in TTS deployment and cross-platform applications:

Target Issue	dia-gguf's Strategy
Lack of natural dialogue intonation	Introduces prosody tokens (like <laugh>, <pause>, etc.) to express tonal changes, building dialogue-aware pronunciation style
High inference threshold, complex deployment	Through GGUF format encapsulation + multi-level quantization (Q2/Q4/Q6/F16), supports offline running on CPU, no need for specialized GPU
Fragmented model deployment formats	Uses GGUF standard format to encapsulate model parameters and structure information, compatible with TTS.cpp/gguf-connector and other frameworks, achieving cross-platform operation

5.3 Limitation Analysis

Dia has advantages in lightweight and cross-platform deployment but also faces some practical application challenges:

5.3.1 Acoustic Decoder May Become a Bottleneck

If using high-fidelity decoders (such as VQ-VAE or GAN-based vocoders), inference phase efficiency depends on the vocoder itself
Current gguf‑connector is mainly implemented in C++, not as efficient as GPU-side HiFi-GAN

5.3.2 Lacks Flexible Style Transfer Mechanism

Current version mainly targets single dialogue style, does not support style transfer or emotion control in multi-speaker, multi-emotion scenarios
No encoder-decoder separation structure, limiting style transfer scalability

5.3.3 Clear Trade-off Between Precision and Naturalness

Low-bit quantization (like Q2) is fast for inference but prone to speech fragmentation and detail loss, not suitable for high-fidelity scenarios
If deployed in voice assistant or announcer systems, user experience will decline for audio quality-sensitive users

6. Orpheus: LLM-Based End-to-End TTS

6.1 Architecture Design

Orpheus adopts an end-to-end design approach based on LLMs:

graph TD
A[Text Prompt + Emotion tokens] --> B[LLaMA 3B - finetune]
B --> C[Generate audio tokens - discretized speech representation]
C --> D[SNAC decoder]
D --> E[Reconstruct audio waveform]

Main modules and their functions:

LLaMA 3B Structure: The foundation is Meta's Transformer architecture, with Orpheus performing SFT (Supervised Finetuning) to learn audio token prediction
Tokenization: Uses audio codec from the SoundStorm series to discretize audio (similar to VQVAE) forming training targets
Output Form: The model's final stage predicts multiple audio token sequences (token-class level autoregression), which can be concatenated to reconstruct speech
Decoder: Uses SNAC (Streaming Non-Autoregressive Codec) to decode audio tokens into final waveform

SNAC Decoder in Detail

SNAC (Spectral Neural Audio Codec) is a neural network audio codec used in TTS models to convert audio codes into actual audio waveforms.

graph TD
A[Orpheus audio codes] --> B[Code redistribution]
B --> C[SNAC three-layer decoding]
C --> D[PCM audio waveform]

Basic Concept

SNAC is a neural network audio decoder specifically designed for TTS models. It receives discrete audio codes generated by TTS models (such as Orpheus) and converts these codes into high-quality 24kHz audio waveforms. SNAC's main feature is its ability to efficiently process hierarchically encoded audio information and generate natural, fluent speech.

Technical Architecture

Layered Structure: SNAC uses a 3-layer structure to process audio information, while the Orpheus model generates 7-layer audio codes. This requires code redistribution.
Code Redistribution Mapping:
- SNAC layer 0 receives Orpheus layer 0 codes
- SNAC layer 1 receives Orpheus layers 1 and 4 codes (interleaved)
- SNAC layer 2 receives Orpheus layers 2, 3, 5, and 6 codes (interleaved)

Decoding Process:

Orpheus audio codes → Code redistribution → SNAC three-layer decoding → PCM audio waveform

Implementation Methods

SNAC has two main implementation methods:

PyTorch Implementation:
- Uses the original PyTorch model for decoding
- Suitable for environments without ONNX support
- Relatively slower decoding speed
ONNX Optimized Implementation:
- Uses pre-trained models in ONNX (Open Neural Network Exchange) format
- Supports hardware acceleration (CUDA or CPU)
- Provides quantized versions, reducing model size and improving inference speed
- Better real-time performance (higher RTF - Real Time Factor)

Code Processing Flow

Code Validation:
- Checks if codes are within valid range
- Ensures the number of codes is a multiple of ORPHEUS_N_LAYERS (7)
Code Padding:
- If the number of codes is not a multiple of 7, automatic padding is applied
- Uses the last valid code or default code for padding
Code Redistribution:
- Remaps 7-layer Orpheus codes to 3-layer SNAC codes
- Follows specific mapping rules
Decoding:
- Uses the SNAC model (PyTorch or ONNX) to convert redistributed codes into audio waveforms
- Outputs 24kHz sample rate mono PCM audio data

Role in TTS Models

SNAC plays a key role in the entire TTS workflow:

The TTS model (Orpheus) generates audio codes
The SNAC decoder converts these codes into actual audio waveforms
The audio waveform undergoes post-processing (such as fade in/out, gain adjustment, watermarking, etc.)
The final audio is encoded in Opus format and transmitted to the client via HTTP or WebSocket

SNAC's efficient decoding capability is one of the key technologies for achieving low-latency, high-quality streaming TTS, enabling the model to respond to user requests in real time.

6.2 Technical Advantages

Orpheus provides innovative solutions to multiple issues in TTS models:

Issue	Solution
Complex multi-module deployment	Integrates TTS into LLM, builds single-model structure, directly generates audio tokens
High inference latency	Uses low-bit quantization (Q4_K_M), combined with GGUF format, accelerating inference
Uncontrollable emotions	Introduces <laugh>, <sigh>, <giggle> and other prompt control tokens
Cloud service dependency	Can run locally on llama.cpp/LM Studio, no need for cloud inference
Separation from LLM	Compatible with LLM dialogue structure, can directly generate speech responses in multimodal dialogue

6.3 Limitation Analysis

Orpheus has innovations in end-to-end design but also faces some practical application challenges:

6.3.1 Emotion Control Lacks Structural Modeling

Emotions are only controlled through “prompt token” insertion, lacking systematic emotion modeling modules
May lead to the same <laugh> showing unstable, occasionally ineffective performance (prompt injection instability)

6.3.2 Strong Decoder Binding

Using SNAC decoder means final sound quality is tightly bound to the audio codec, cannot be freely replaced with alternatives like HiFi-GAN
If the codec produces artifacts, the entire model struggles to independently optimize the decoding module

6.3.3 Difficult Customization

Does not support zero-shot speaker cloning
Generating user-customized voices still requires “fine-tuning,” creating a training threshold

7. OuteTTS: GGUF Format Optimized TTS

7.1 Architecture Design

OuteTTS adopts an optimized design suitable for GGUF format deployment:

graph TD
A[Prompt input - text and control information] --> B[Prompt Encoder - semantic modeling]
B --> C[Alignment module - automatic position alignment]
C --> D[Codebook Decoder - generates dual codebook tokens]
D --> E[HiFi-GAN Vocoder - restores to speech waveform]
E --> F[Output audio - wav or mp3]
subgraph Control Information
A1[Tone pause emotion tokens]
A2[Pitch duration speaker ID]
end
A1 --> A
A2 --> A

Main modules and their functions:

Module	Description
Prompt Encoder	Input is natural language prompt (with context, speaker, timbre information), similar to instruction-guided model generating speech content
Alignment Module (internal modeling)	Embedded alignment capability, no need for external alignment tool, builds position-to-token mapping based on transformer
Codebook Decoder	Maps text to dual codebook tokens under DAC encoder (e.g., codec-C1, codec-C2), as latent representation of audio content
Vocoder (HiFi-GAN)	Maps DAC codebook or speech features to final playable audio (supports .wav), deployed on CPU/GPU

DAC Decoder in Detail

DAC (Discrete Audio Codec) is a discrete audio codec used in TTS models primarily to convert audio codes generated by OuteTTS models into actual audio waveforms. DAC is an efficient neural network audio decoder specifically designed for high-quality speech synthesis.

graph TD
A[OuteTTS audio codes] --> B[DAC decoding]
B --> C[PCM audio waveform]
A --> |c1_codes| B
A --> |c2_codes| B

Technical Architecture

Encoding Structure: DAC uses a 2-layer encoding structure (dual codebook), with each codebook having a size of 1024, which differs from SNAC's 3-layer structure.
Code Format:
- DAC uses two sets of codes: c1_codes and c2_codes
- These two sets of codes have the same length and correspond one-to-one
- Each code has a value range of 0-1023

Decoding Process:

OuteTTS audio codes(c1_codes, c2_codes) → DAC decoding → PCM audio waveform

Sample Rate: DAC generates 24kHz sample rate audio, the same as SNAC

Implementation Methods

Similar to SNAC, DAC also has two implementation methods:

PyTorch Implementation:
- Uses the original PyTorch model for decoding
- Suitable for environments without ONNX support
ONNX Optimized Implementation:
- Uses pre-trained models in ONNX format
- Supports hardware acceleration (CUDA or CPU)
- Provides quantized versions, reducing model size and improving inference speed

DAC's Advanced Features

The DAC decoder implements several advanced features that make it particularly suitable for streaming TTS applications:

Batch Processing Optimization:
- Adaptive batch size (8-64 frames)
- Dynamically adjusts batch size based on performance history
Streaming Processing:
- Supports batch decoding and streaming output
- Adaptively adjusts parameters based on network quality
Audio Effect Processing:
- Supports fade in/out effects
- Supports audio gain adjustment

Comparison Between SNAC and DAC

Feature	DAC	SNAC
Encoding Layers	2 layers	3 layers
Code Organization	Two parallel code sets	Three hierarchical code layers
Codebook Size	1024	4096
Input Format	c1_codes, c2_codes	7-layer Orpheus codes redistributed to 3 layers

Applicable Models

DAC: Designed specifically for OuteTTS-type models, processes dual codebook format audio codes
SNAC: Designed specifically for Orpheus-type models, processes 7-layer encoded format audio codes

Performance Characteristics

DAC: More focused on streaming processing and low latency, with more adaptive optimizations
SNAC: More focused on audio quality and accurate code redistribution

Code Processing Methods

DAC: Directly processes two sets of codes, no complex redistribution needed
SNAC: Needs to remap 7-layer Orpheus codes to a 3-layer structure

Why Different Models Use Different Decoders

OuteTTS and Orpheus use different decoders primarily for the following reasons:

Model Design Differences:
- OuteTTS model was designed with DAC compatibility in mind, directly outputting DAC format dual codebook codes
- Orpheus model is based on a different architecture, outputting 7-layer encoding, requiring SNAC for decoding
Encoding Format Incompatibility:
- DAC expects to receive two parallel code sets (c1_codes, c2_codes)
- SNAC expects to receive redistributed 3-layer codes, which come from Orpheus's 7-layer output
Different Optimization Directions:
- OuteTTS+DAC combination focuses more on streaming processing and low latency
- Orpheus+SNAC combination focuses more on audio quality and multi-level encoding

7.2 Technical Advantages

OuteTTS provides innovative solutions to multiple issues in TTS models:

Target Issue	Llama-OuteTTS's Strategy
Multilingual TTS without preprocessing	Directly supports Chinese, English, Japanese, Arabic and other languages, no need for pinyin conversion or forced spacing
Difficult alignment, requires external CTC	Model has built-in alignment mechanism, directly aligns text to generated tokens, no need for external alignment tools
Audio quality vs. throughput conflict	DAC + dual codebook improves audio quality; generates 150 tokens per second, speed significantly improved compared to similar diffusion models
Complex model invocation	GGUF format encapsulated structure + llama.cpp support, more streamlined local deployment

7.3 Limitation Analysis

OuteTTS has innovations in GGUF format optimization but also faces some practical application challenges:

7.3.1 Audio Encoding Bottleneck

Currently mainly uses DAC-based dual codebook expression, which improves audio quality, but:
- Decoder (HiFi-GAN) remains a bottleneck, especially with inference latency on edge devices
- If using more complex models (like VQ-VAE) in the future, their parallelism and efficient inference will become more problematic
- Current gguf-connector is C++-based, does not yet support native mobile deployment (like Android/iOS TensorDelegate)

7.3.2 Parallelism and Context Dependency

Model strongly depends on context memory (such as token temporal dependencies), during inference:
- Cannot parallelize extensively like some autoregressive diffusion models, inference remains serially dominated
- Sampling stage requires setting repetition penalty window (default 64 tokens)
- High context length (e.g., 8192) is supported but significantly increases memory cost during deployment

7.3.3 Insufficient Style Transfer and Personality Control

Current version mainly optimized for “single person + tone control,” style transfer mechanism not sophisticated enough:
- Lacks speaker embedding-based control mechanism
- Multi-emotion, multi-style still requires prompt fine-tuning rather than explicit token control
- Future needs to introduce speaker encoder or style/emotion vectors

8. F5-TTS: Diffusion Model Optimized TTS

8.1 Architecture Design

F5-TTS adopts an innovative design based on diffusion models:

graph TD
A[Text - character sequence] --> B[ConvNeXt text encoder]
B --> C[Flow Matching module]
C --> D[DiT diffusion Transformer - non-autoregressive generation]
D --> E[Speech Token]
E --> F[Vocoder - Vocos or BigVGAN]
F --> G[Waveform audio output]

Main modules and their functions:

Module	Description
ConvNeXt text encoder	Used to extract global features of text, with parallel convolution capability
Flow Matching	Used in training process to learn noise → speech token mapping path
DiT (Diffusion Transformer)	Core synthesizer, parallel speech token generator based on diffusion modeling
Sway Sampling	Optimizes sampling path during inference, reducing ineffective diffusion steps, improving speed and quality
Vocoder	Uses BigVGAN or Vocos to restore speech tokens to waveform audio

8.2 Technical Advantages

F5-TTS provides innovative solutions to multiple issues in TTS models:

Issue	F5-TTS's Solution
Phoneme alignment, duration dependency	Input characters directly fill alignment, not dependent on duration predictor or aligner
Unnatural speech quality, weak cloning ability	Uses diffusion-based speech token synthesis, with sway sampling technology to enhance naturalness

8.3 Limitation Analysis

F5-TTS has innovations in diffusion model optimization but also faces some practical application challenges:

8.3.1 Inference Requires Multi-Step Sampling

Although sway sampling is optimized, inference still needs to execute diffusion sampling process (about 20 steps)

8.3.2 Dependency on Vocoder

Final speech quality highly depends on vocoder (like vocos, BigVGAN), requiring separate deployment

8.3.3 Weak Audio Length Control

No explicit duration predictor, speed control requires additional prompts or sampling techniques

8.3.4 License Restrictions

Uses CC-BY-NC-4.0 open source license, cannot be used commercially directly, must follow authorization terms

9. Index-TTS: Multimodal Conditional TTS

9.1 Architecture Design

Index-TTS adopts an innovative design with multimodal conditional control:

graph TD
A[Text input] --> B[Pinyin-enhanced text encoder]
B --> C[GPT-style language model - Decoder-only]
C --> D[Predict speech token sequence]
D --> E[BigVGAN2 - decode to waveform]
F[Reference speech] --> G[Conformer conditional encoder]
G --> C

Main modules and their functions:

Module Name	Function Description
Text encoder (character + pinyin)	Chinese supports pinyin input, English directly models characters - Can accurately capture pronunciation features, solving complex reading problems like polyphonic characters and neutral tones
Neural audio tokenizer	Uses FSQ encoder to convert audio to discrete tokens - Each frame (25Hz) expressed with multiple codebooks, token utilization rate reaches 98%, far higher than VQ
LLM-style Decoder (GPT structure)	Decoder-only Transformer architecture - Conditional inputs include text tokens and reference audio - Supports multi-speaker migration and zero-shot speech generation
Conditional Conformer encoder	Encodes implicit features like timbre, rhythm, prosody in reference audio - Provides stable control vector input to GPT, enhancing stability and timbre restoration
BigVGAN2	Decodes final audio waveform - Balances high fidelity and real-time synthesis performance

9.2 Technical Advantages

Index-TTS provides innovative solutions to multiple issues in TTS models:

Issue	IndexTTS's Solution
Polyphonic character control	Character+pinyin joint modeling, can explicitly specify pronunciation
Poor speaker consistency	Introduces Conformer conditional module, uses reference audio to enhance control capability
Low audio token utilization	Uses FSQ instead of VQ-VAE, effectively utilizes codebook, enhances expressiveness
Poor model stability	Phased training + conditional control, reduces divergence, ensures synthesis quality
Poor English compatibility	IndexTTS 1.5 strengthens English token learning, enhances cross-language adaptability
Slow inference	GPT decoder + BigVGAN2, balances naturalness and speed, can deploy industrial systems

9.3 Limitation Analysis

Index-TTS has innovations in multimodal conditional control but also faces some practical application challenges:

9.3.1 Prosody Control Depends on Reference Audio

Current model's prosody generation mainly relies on implicit guidance from input reference audio
- Lacks explicit prosody annotation or token control mechanism, cannot manually control pauses, stress, intonation, and other information
- When reference audio is not ideal or style differences are large, prosody transfer effects can easily become unnatural or inconsistent
Not conducive to template-based large-scale application scenarios (such as customer service, reading) where controllability and stability are needed

9.3.2 Generation Uncertainty

Uses GPT-style autoregressive generation structure, although speech naturalness is high, there is some uncertainty:
- The same input in different inference rounds may fluctuate in speech rate, prosody, and slight timbre
- Difficult to completely reproduce generation results, not conducive to audio caching and version management
In high-consistency requirement scenarios (such as film post-production, legal synthesis), may affect delivery stability

9.3.3 Speaker Migration Not Completely End-to-End

Current speaker control module still relies on explicit reference audio embedding (such as speaker encoder) as conditional vector input
- Speaker vectors need external module extraction, not end-to-end integration
- When reference audio quality is low or speaking style varies greatly, cloning effect is unstable
Does not support completely text-driven speaker specification (such as specifying speaker ID generation), limiting automated deployment flexibility

10. Mega-TTS3: Unified Modeling TTS

10.1 Architecture Design

Mega-TTS3 adopts an innovative design with unified modeling:

graph TD
A[Text Token - BPE] --> B[Text Encoder - Transformer]
B --> C[Unified Acoustic Model - UAM]
C --> D[Latent Acoustic Token]
subgraph Control Branches
E1[Prosody embedding] --> C
E2[Speaker representation] --> C
E3[Language label] --> C
end
D --> F[Vocoder - HiFi-GAN or FreGAN]
F --> G[Audio output]

Main modules and their functions:

Module	Description
Text Encoder	Encodes input text tokens into semantic vectors, supports multilingual tokens
UAM (Unified Acoustic Model)	Core module, fuses Text, Prosody, Speaker, Language information, predicts acoustic latent
Continuous Speaker Modeling	Models speaker information across time sequence, reducing style drift issues
Prosody Control Module	Provides independent prosody controller, can precisely control pauses, rhythm, pitch, etc.
Vocoder	Finally decodes latent tokens into audio waveforms, using HiFi-GAN / FreGAN

10.2 Technical Advantages

Mega-TTS3 provides innovative solutions to multiple issues in TTS models:

Issue	Description	Mega-TTS3's Solution
Inconsistent modeling granularity	Different modules (text, prosody, speech) have inconsistent modeling granularity, causing information fragmentation and style transfer distortion	Introduces Unified Acoustic Model (UAM), fusing text encoding, prosody information, language labels and audio latent in unified modeling, avoiding staged information loss
Difficult multi-speaker modeling	Traditional embedding methods struggle to stably model large numbers of speakers, with insufficient generalization and synthesis consistency	Proposes Continuous Speaker Embedding, embedding speaker representation as temporal vector into unified modeling process, improving style consistency and transfer stability
Weak control granularity	Lacks pluggable independent control mechanisms when controlling emotion, speed, prosody, and other styles	Designs pluggable control branches (Prosody / Emotion / Language / Speaker Embedding), each control signal independently modeled, can be combined and flexibly plugged in, enhancing control precision
Cross-language interference	Sparse language label modeling, multi-language models often interfere with each other, affecting speech quality	Introduces explicit language label embedding + multilingual shared Transformer parameter mechanism, enhancing language sharing while ensuring language identifiability, alleviating inter-language interference

10.3 Limitation Analysis

Mega-TTS3 has innovations in unified modeling but also faces some practical application challenges:

10.3.1 Limited Control Granularity & Weak Interpretability

Although control dimensions are many (emotion, speed, prosody, etc.), they still rely on end-to-end model implicit modeling:
- Lacks pluggable independent control modules
- Strong coupling between control variables, difficult to precisely control single dimensions
- Not suitable for “controllable interpretable synthesis” scenarios oriented toward industrial deployment

10.3.2 Uneven Multilingual Speech Quality

Despite supporting multilingual modeling, actual generation still shows:
- Heavy dependence on language labels, label errors directly lead to pronunciation disorder
- Inter-language interference issues (such as accent drift in Chinese-English mixed reading)
- Low-resource language generation effects significantly lower than high-resource languages

11. Summary and Outlook

11.1 Modern TTS Model Architecture Trends

Through in-depth analysis of ten mainstream TTS models, we can observe the following clear technical trends:

Unified Architecture: From early multi-module cascades to today's end-to-end unified architectures, TTS models are developing toward more integrated directions
Discrete Token Representation: Using discrete tokens to represent audio has become mainstream, more suitable for fusion with models like LLMs
Coexistence of Diffusion and Autoregression: Diffusion models provide high-quality generation capabilities, while autoregressive models have advantages in context modeling
Multimodal Conditional Control: Controlling speech generation through multimodal inputs such as reference audio and emotion labels, enhancing personalization capabilities
Deployment Format Standardization: Popularization of formats like GGUF makes TTS models easier to deploy on different platforms

11.2 Technical Challenges and Future Directions

Despite significant progress in modern TTS models, they still face some key challenges:

Inference Efficiency vs. Audio Quality Balance: How to improve inference speed while ensuring high audio quality, especially on edge devices
Controllability vs. Naturalness Trade-off: Enhancing control capabilities often sacrifices speech naturalness; balancing the two is an ongoing challenge
Multilingual Consistency: Building truly high-quality multilingual TTS models, ensuring consistency and quality across languages
Emotional Expression Depth: Current models still have limitations in nuanced emotional expression, requiring deeper emotion modeling in the future
Long Text Coherence: Improving coherence and consistency in long text generation, especially at paragraph and chapter levels of speech synthesis

11.3 Application Scenario Matching Recommendations

Different TTS models are suitable for different application scenarios. Here are some matching recommendations:

Application Scenario	Recommended Models	Rationale
Edge devices/Low-resource environments	Kokoro, Dia	Lightweight design, supports ONNX/GGUF format, low latency
High-quality audio content creation	Index-TTS, F5-TTS	High-quality output, supports reference audio cloning, suitable for professional content production
Multilingual customer service systems	Mega-TTS3	Excellent multilingual support, unified modeling architecture, good stability
Conversational voice assistants	CosyVoice, Orpheus	Good compatibility with LLMs, supports dialogue context, natural emotional expression
Local deployment voice applications	OuteTTS	GGUF format optimization, supports CPU inference, no need for cloud services

With continued technological advancement, we can expect future TTS models to further break modal boundaries, achieving more natural, personalized, and emotionally rich voice interaction experiences.

Speech Synthesis Evolution: From Traditional TTS to Multimodal Voice Models

Fri, 27 Jun 2025 07:01:00 +0000

1. Background

1.1 Pain Points of Traditional TTS Models

Traditional Text-to-Speech (TTS) models have excelled in voice cloning and speech synthesis, typically employing a two-stage process:

Acoustic Model (e.g., Tacotron): Converts text into intermediate acoustic representations (such as spectrograms).
Vocoder (e.g., WaveGlow, HiFi-GAN): Transforms acoustic representations into waveform audio.

Despite these models’ ability to produce realistic sounds, their primary focus remains on replicating a speaker's voice, lacking the flexibility to adapt in dynamic, context-sensitive conversations.

1.2 Initial Integration of LLMs: Context-Aware Conversational Voice Models

The emergence of Large Language Models (LLMs) has provided rich reasoning capabilities and contextual understanding. Integrating LLMs into the TTS workflow enables synthesis that goes beyond mere sound production to intelligent conversational responses within context.

Typical cascade workflow (speech-to-speech model):

STT (Speech-to-Text): e.g., Whisper
LLM (Contextual Understanding and Generation): e.g., fine-tuned Llama
TTS (Text-to-Speech): e.g., ElevenLabs

Example workflow:

Speech-to-Text (e.g., Whisper) : "Hello friend, how are you?"
Conversational LLM (e.g., Llama) : "Hi there! I am fine and you?"
Text-to-Speech (e.g., ElevenLabs) : Generates natural speech response

This pipeline approach integrates the strengths of specialized modules but has limitations: The transcribed text received by the LLM loses rich prosodic and emotional cues from the original speech, resulting in responses that lack the nuanced expression of the original voice.

1.3 Direct Speech Input to LLMs: Audio Encoders and Neural Codecs

To address the above bottlenecks, researchers have attempted to directly input speech representations into LLMs. Currently, there are two main approaches to converting continuous high-dimensional speech signals into formats that LLMs can process:

Audio Encoders: Convert continuous speech into discrete tokens, preserving key information such as rhythm and emotion.

New Challenge: Audio encoders must balance between preserving critical information and the need for compact, discrete representations.
Neural Codecs: Such as DAC, Encodec, XCodec, which convert audio waveforms into discrete token sequences, bridging the gap between continuous audio and discrete token requirements.

New Challenge: Audio tokens are far more numerous than text, and the quantization process may lead to loss of details.

2. TTS Model Structure

The basic structural flow of traditional TTS models is typically as follows:

graph TD
A[Text] --> B[Encoder]
B --> C[Intermediate Representation]
C --> D[Decoder]
D --> E[Mel Spectrogram]
E --> F[Vocoder]
F --> G[Waveform]

This workflow includes several key components:

Text Encoder: Responsible for converting input text into an intermediate representation, usually a deep learning model such as a Transformer or CNN. The encoder needs to understand the semantics, syntactic structure of the text, and extract pronunciation-related features.
Intermediate Representation: The bridge connecting the encoder and decoder, typically a set of vectors or feature maps containing the semantic information of the text and some preliminary acoustic features.
Decoder: Converts the intermediate representation into acoustic features, such as Mel spectrograms. The decoder needs to consider factors like prosody, rhythm, and pauses in speech.
Vocoder: Transforms acoustic features (such as Mel spectrograms) into final waveform audio. Modern vocoders like HiFi-GAN and WaveGlow can generate high-quality speech waveforms.

3. In-Depth Analysis of Audio Encoder Technology

Audio encoders are crucial bridges connecting continuous speech signals with discrete token representations. Below, we delve into several mainstream audio encoding technologies and their working principles.

3.1 VQ-VAE (Vector Quantized Variational Autoencoder)

VQ-VAE is an effective method for converting continuous audio signals into discrete codes. Its working principle is as follows:

Encoding Stage: Uses an encoder network to convert input audio into continuous latent representations.
Quantization Stage: Maps continuous latent representations to the nearest discrete codebook vectors.
Decoding Stage: Uses a decoder network to reconstruct audio signals from quantized latent representations.

The advantage of VQ-VAE lies in its ability to learn compact discrete representations while preserving key information needed for audio reconstruction. However, it also faces challenges such as low codebook utilization (codebook collapse) and trade-offs between reconstruction quality and compression rate.

3.2 Encodec

Encodec is an efficient neural audio codec proposed by Meta AI, combining the ideas of VQ-VAE with multi-level quantization techniques:

Multi-Resolution Encoding: Uses encoders with different time resolutions to capture audio features at different time scales.
Residual Quantization: Adopts a multi-level quantization strategy, with each level of quantizer processing the residual error from the previous level.
Variable Bit Rate: Supports different compression levels, allowing for adjustment of the balance between bit rate and audio quality according to needs.

A significant advantage of Encodec is its ability to maintain good audio quality at extremely low bit rates, making it particularly suitable for speech synthesis and audio transmission applications.

3.3 DAC (Discrete Autoencoder for Audio Compression)

DAC is a discrete autoencoder designed specifically for audio compression, with features including:

Hierarchical Quantization: Uses a multi-level quantization structure, with different levels capturing different levels of audio detail.
Context Modeling: Utilizes autoregressive models to model quantized token sequences, capturing temporal dependencies.
Perceptual Loss Function: Combines spectral loss and adversarial loss to optimize audio quality as perceived by the human ear.

DAC maintains excellent audio quality even at high compression rates, making it particularly suitable for speech synthesis applications requiring efficient storage and transmission.

4. Audio Data Formats and Transmission in TTS Systems

In TTS systems, the choice of audio formats and transmission methods is crucial for practical applications. This chapter details the various audio formats, transmission protocols, and frontend processing techniques used in TTS systems.

4.1 Common Audio Formats and Their Characteristics

TTS systems support multiple audio formats, each with specific use cases and trade-offs. Here are the most commonly used formats:

4.1.1 PCM (Pulse Code Modulation)

Characteristics:

No Compression: Raw audio data without any compression
Bit Depth: Typically 16-bit (also 8-bit, 24-bit, 32-bit, etc.)
Simple Format: Directly represents audio waveform as digital samples
File Size: Large, about 2.8MB for one minute of 24kHz/16-bit mono audio
Processing Overhead: Low, no decoding required
Quality: Lossless, preserves all original audio information

Use Cases:

Internal audio processing pipelines
Real-time applications requiring low latency
Intermediate format for further processing

4.1.2 Opus

Characteristics:

High Compression Ratio: Much smaller than PCM while maintaining high quality
Low Latency: Encoding/decoding delay as low as 20ms
Variable Bitrate: 6kbps to 510kbps
Adaptive: Can adjust based on network conditions
Designed for Network Transmission: Strong packet loss resistance
Open Standard: Royalty-free, widely supported

Use Cases:

Network streaming
WebRTC applications
Real-time communication systems
WebSocket audio transmission

Opus Encoding Configuration:

Sample Rate: 24000 Hz
Channels: 1 (Mono)
Bitrate: 32000 bps (32 kbps)
Frame Size: 480 samples (corresponding to 20ms@24kHz)
Complexity: 5 (balanced setting)

4.1.3 MP3

Characteristics:

High Compression Ratio: Much smaller than PCM
Wide Compatibility: Supported by almost all devices and platforms
Variable Bitrate: Typically 32kbps to 320kbps
Lossy Compression: Loses some audio information
Encoding/Decoding Delay: Higher, not suitable for real-time applications
File Size: Medium, about 1MB for one minute of audio (128kbps)

Use Cases:

Non-real-time applications
Scenarios requiring wide compatibility
Audio storage and distribution

4.1.4 WAV

Characteristics:

Container Format: Typically contains PCM data
No Compression: Large files
Metadata Support: Contains information about sample rate, channels, etc.
Wide Compatibility: Supported by almost all audio software
Simple Structure: Easy to process
Quality: Typically lossless

Use Cases:

Audio archiving
Professional audio processing
Testing and development environments

4.2 TTS Audio Transmission and Processing

4.2.1 Basic Audio Parameters

In TTS systems, audio data typically has the following basic parameters:

Sample Rate: Typically 24000 Hz (24 kHz)
Channels: 1 (Mono)
Bit Depth: 16-bit (Int16)

4.2.2 Transmission Protocols

HTTP REST API

Content-Type: audio/opus
Custom Header: X-Sample-Rate: 24000
Data Format: Raw Opus encoded data (non-OggS container)

WebSocket Protocol

Subprotocol: tts-1.0
Message Structure: 1 byte type + 4 bytes length (little-endian) + payload
Audio Message Type: AUDIO = 0x12
Audio Data: Raw Opus encoded data

4.2.3 Frontend Processing Techniques

The frontend of TTS systems needs to process received audio data, primarily in two ways:

WebCodecs API Decoding

Uses browser hardware acceleration to decode Opus data
Converts decoded data to Float32Array for Web Audio API

PCM Direct Processing

Converts Int16 PCM data to Float32 audio data (range from -32768~32767 to -1.0~1.0)
Creates AudioBuffer and plays through Web Audio API

4.2.4 Audio Processing Enhancements

Fade In/Out Effects: Configurable audio fade in/out processing, default 10ms
Audio Gain Adjustment: Adjustable volume
Watermarking: Optional audio watermarking functionality
Adaptive Batch Processing: Dynamically adjusts audio processing batch size based on performance

4.3 Audio Data Flow in TTS Systems

In TTS models, audio data follows this flow from generation to playback:

graph LR
A[Text Input] --> B[TTS Engine]
B --> C[PCM Audio Data]
C --> D[Audio Encoding Opus or MP3]
D --> E[HTTP or WebSocket Transmission]
E --> F[Frontend Reception]
F --> G[Decoding]
G --> H[Web Audio API Playback]

4.4 Format Selection in Practical Applications

In practical TTS applications, format selection is primarily based on the use case:

Real-time Streaming TTS Applications

Opus is preferred due to its low latency characteristics and high compression ratio
Suitable for voice assistants, real-time dialogue systems, online customer service, etc.

Non-real-time TTS Applications

MP3 is more commonly used because it's supported by almost all devices and platforms
Suitable for audiobooks, pre-recorded announcements, content distribution, etc.

Internal System Processing

PCM format is commonly used for internal processing, providing highest quality and lowest processing delay
Suitable for intermediate stages in audio processing pipelines

Archiving and Professional Applications

WAV format is suitable for scenarios requiring metadata preservation and highest quality
Suitable for professional audio editing, archiving, and quality assessment

5. Integration of Neural Codecs with LLMs

The fusion of neural codecs with LLMs is a key step in achieving end-to-end speech understanding and generation. This fusion faces several technical challenges:

5.1 Token Rate Mismatch Problem

Speech signals have a much higher information density than text, resulting in far more audio tokens than text tokens. For example, one second of speech might require hundreds of tokens to represent, while the corresponding text might only need a few tokens. This mismatch poses challenges for LLM processing.

Solutions include:

Hierarchical Encoding: Using multi-level encoding structures to capture information at different time scales
Downsampling Strategies: Downsampling in the time dimension to reduce the number of tokens
Attention Mechanism Optimization: Designing special attention mechanisms to effectively handle long token sequences

5.2 Cross-Modal Representation Alignment

Text and speech are information from two different modalities, with natural differences in their representation spaces. To achieve effective fusion, the representation alignment problem needs to be solved.

Main methods include:

Joint Training: Simultaneously training text encoders and audio encoders to align their representation spaces
Contrastive Learning: Using contrastive loss functions to bring related text and speech representations closer while pushing unrelated representations apart
Cross-Modal Transformers: Designing specialized Transformer architectures to handle multi-modal inputs and learn relationships between them

5.3 Context-Aware Speech Synthesis

Traditional TTS models often lack understanding of context, resulting in generated speech lacking appropriate emotional and prosodic variations. After fusion with LLMs, models can generate more natural speech based on conversation context.

Key technologies include:

Context Encoding: Encoding conversation history into context vectors that influence speech generation
Emotion Control: Automatically adjusting the emotional color of speech based on context understanding
Prosody Modeling: Adjusting speech rhythm, pauses, and stress according to semantic importance and conversation state

6. Future Development Directions

As technology continues to advance, TTS models are developing in the following directions:

6.1 End-to-End Multimodal Models

Future voice models will break down barriers between modules, achieving true end-to-end training and inference. Such models will be able to generate natural speech outputs directly from raw inputs (text, speech, images, etc.) without explicit conversion of intermediate representations.

6.2 Personalization and Adaptability

Next-generation TTS models will place greater emphasis on personalization and adaptability, automatically adjusting speech characteristics based on user preferences, conversation history, and environmental factors, providing a more natural and humanized interaction experience.

6.3 Low-Resource Scenario Optimization

For low-resource languages and special application scenarios, researchers are exploring how to leverage transfer learning, meta-learning, and data augmentation techniques to build high-quality TTS models under limited data conditions.

6.4 Real-Time Interactive Speech Synthesis

With the advancement of algorithms and hardware, real-time interactive speech synthesis will become possible, supporting more natural and fluid human-machine dialogue, providing better user experiences for virtual assistants, customer service robots, and metaverse applications.

7. Conclusion

Speech synthesis technology is undergoing a significant transformation from traditional TTS to multimodal voice models. Through the integration of large language models, neural codecs, and advanced audio processing technologies, modern TTS models can not only generate high-quality speech but also understand context, express emotions, and naturally adapt in dynamic conversations. Despite facing many challenges, with continuous technological advancement, we can expect more intelligent, natural, and personalized voice interaction experiences.