自然语言处理 | 林子杨的个人网站

现代ASR技术解析：从传统模型到大语言模型驱动的新范式

Sat, 28 Jun 2025 13:00:00 +0000

1. 背景

1.1 传统ASR模型的痛点

传统的自动语音识别（ASR）模型，如基于隐马尔可夫模型-高斯混合模型（HMM-GMM）或深度神经网络（DNN）的模型，在特定领域和受控环境下表现良好，但面临诸多挑战：

数据稀疏性：对大规模、高质量的标注数据集依赖严重，在低资源语言或特定口音上泛化能力差。
鲁棒性不足：在嘈杂环境、远场拾音、多人对话等真实场景下，性能会急剧下降。
上下文理解缺失：模型通常局限于声学特征到文本的直接映射，缺乏对长程上下文、语义和说话人意图的理解，导致识别错误（如同音异义词混淆）。
多任务能力有限：传统模型通常是单一任务的，如仅支持语音转录，而无法同时完成说话人日志、语种识别、翻译等任务。

1.2 大语言模型（LLM）驱动的 ASR 新范式

近年来，以 Whisper 为代表的端到端大型 ASR 模型，通过在海量、多样化的无监督或弱监督数据上进行预训练，展现了前所未有的鲁棒性和泛化能力。这些模型通常采用 Encoder-Decoder 架构，将 ASR 任务视为一个序列到序列的翻译问题。

典型流程：

graph TD
A["原始音频波形"] --> B["特征提取（如Log-Mel谱图）"]
B --> C["Transformer Encoder"]
C --> D["Latent Representation"]
D --> E["Transformer Decoder"]
E --> F["文本序列输出"]

这种方法不仅简化了传统 ASR 的复杂流水线，还通过大规模数据学习到了丰富的声学和语言知识，从而在零样本（Zero-shot）场景下也能取得优异表现。

2. ASR模型的解决方案分析

2.1 Whisper-large-v3-turbo

Whisper 是由 OpenAI 开发的预训练 ASR 模型，其 large-v3 和 large-v3-turbo 版本是目前业界领先的模型之一。

2.1.1 Whisper的设计

结构模块：

graph TD
A["音频输入 (30s片段)"] --> B["Log-Mel谱图"]
B --> C["Transformer Encoder"]
C --> D["Encoded Representation"]
D --> E["Transformer Decoder"]
E --> F["预测文本Token"]
subgraph "多任务处理"
E --> G["转录"]
E --> H["翻译"]
E --> I["语种识别"]
end

特点：

大规模弱监督训练：在 68 万小时的多语言、多任务数据上进行训练，覆盖了广泛的口音、背景噪音和技术术语。
端到端架构：一个统一的 Transformer 模型直接将音频映射到文本，无需外部的语言模型或对齐模块。
多任务能力：模型能够同时处理多语言语音转录、语音翻译和语种识别。
鲁棒性：通过对数据进行精心设计的数据增强和混合，模型在各种具有挑战性的条件下都表现出色。
Turbo 版本：large-v3-turbo 是 large-v3 的优化版本，可能在推理速度、计算效率或特定任务性能上有所提升，参数量约为 798M。

2.1.2 解决的问题

目标问题	Whisper 的应对方案
泛化能力差	在海量、多样化的数据集上进行大规模预训练，覆盖近百种语言。
鲁棒性不足	训练数据包含各种背景噪音、口音和说话风格，提升了真实场景下的性能。
上下文建模弱	Transformer 架构能够捕捉音频信号中的长程依赖关系。
部署复杂	提供多种模型尺寸（从 `tiny` 到 `large`），并开源了代码和模型权重，方便社区使用和部署。

2.1.3 生产缺陷分析

2.1.3.1 “幻觉”（Hallucination）问题

在无语音或噪声片段中，模型有时会生成无意义或重复的文本，这是大型自回归模型的通病。
这种现象在长音频处理中尤为明显，可能需要额外的后处理逻辑来检测和过滤。

2.1.3.2 时间戳精度有限

模型预测的时间戳是词级别的，但其精度可能不足以满足某些应用（如字幕对齐、语音编辑）的苛刻要求。
在长段静音或快速语流中，时间戳的准确性会下降。

2.1.3.3 计算资源要求高

large-v3 模型包含 15.5 亿参数，turbo 版本也有近 8 亿参数，对计算资源（特别是 GPU 显存）要求较高，不适合在边缘设备上直接运行。
虽然有量化等优化手段，但在保证性能的同时降低资源消耗仍是一个挑战。

2.1.3.4 实时性瓶颈

模型基于 30 秒的音频窗口进行处理，对于实时流式 ASR 场景，需要设计复杂的滑动窗口和缓存机制，这会引入额外的延迟。

2.2 SenseVoice

SenseVoice 是由阿里巴巴达摩院的语音团队开发的下一代工业级 ASR 模型。与 Whisper 专注于鲁棒的通用转录不同，SenseVoice 在设计上更侧重于多功能性、实时性和与下游任务的结合。

2.2.1 SenseVoice的设计

结构模块：

graph TD
A["音频流"] --> B["FSMN-VAD (语音活动检测)"]
B --> C["Encoder (如 SAN-M)"]
C --> D["Latent Representation"]
D --> E["Decoder"]
E --> F["文本输出"]
subgraph "多任务与控制"
G["说话人日志"] --> C
H["情绪识别"] --> C
I["零样本TTS Prompt"] --> E
end

特点：

统一端到端模型：集成了声学模型、语言模型和标点预测，实现了从语音到带标点文本的端到端输出。
多任务学习：模型不仅能进行语音识别，还能同时输出说话人日志（Diarization）、情绪信息，甚至可以生成用于零样本 TTS 的声学 prompt。
流式与非流式一体化：通过统一的架构支持流式和非流式两种模式，满足实时和离线场景的需求。
与 TTS 联动：SenseVoice 的一个创新点是其输出可以作为 CosyVoice 等 TTS 模型的 prompt，实现声音的克隆和迁移，打通了 ASR 与 TTS 的闭环。

2.2.2 解决的问题

目标问题	SenseVoice 的应对方案
任务单一，集成困难	设计为多任务模型，原生支持说话人日志、情绪识别等，简化了对话系统的构建。
实时性差	采用高效的流式架构（如 SAN-M），并结合 VAD，实现了低延迟的实时识别。
缺乏与下游任务的协同	输出包含了丰富的元信息（如说话人、情绪），并能生成 TTS prompt，实现了 ASR 与 TTS 的深度融合。
标点恢复依赖后处理	将标点预测作为模型的一个内置任务，实现了文本和标点的联合建模。

2.2.3 生产缺陷分析

2.2.3.1 模型复杂度与维护

作为一个集成了多种功能的复杂模型，其训练和维护成本相对较高。
多任务之间的平衡可能需要精细的调整，以避免某一任务性能的下降。

2.2.3.2 零样本能力的泛化性

虽然支持零样本 TTS prompt 生成，但其声音克隆的效果和稳定性在面对未见过的说话人或复杂声学环境时，可能不如专门的 voice cloning 模型。

2.2.3.3 开源生态与社区

相较于 Whisper 强大的开源社区和丰富的生态工具，SenseVoice 作为工业级模型，其开源程度和社区支持可能相对有限，这会影响其在学术界和开发者社区中的普及。

3. 总结

Whisper：通过大规模弱监督学习，将 ASR 的鲁棒性和泛化能力推向了新的高度。它是一个强大的通用语音识别器，特别适合处理多样化、非受控的音频数据。其设计哲学是"用规模换性能”，在零样本和多语言场景下表现卓越。
SenseVoice：代表了 ASR 技术向多功能、一体化方向发展的趋势。它不仅仅是一个识别器，更是一个对话智能的感知前端，旨在为下游任务（如对话系统、TTS）提供更丰富、更实时的输入。其设计哲学是"融合与协同”，强调 ASR 在整个智能交互链路中的枢纽作用。

总的来说，Whisper 定义了现代 ASR 的性能基线，而 SenseVoice 则探索了 ASR 在工业应用中更广阔的可能性。未来的 ASR 技术可能会朝着二者结合的方向发展：既有 Whisper 的鲁棒性和泛化能力，又有 SenseVoice 的多任务协同和实时处理能力。

2020 评估编辑新闻标题的幽默程度

Mon, 27 Jul 2020 04:56:23 +0100

This project aims to develop potential solutions for the tasks rised by the competition Assessing the Funniness of Edited News Headlines (SemEval-2020) on the platform CodaLab

As of July 26, 2020, the test result of my trained model (‘bert-base-uncased’ from the Huggingface transformers) ranked third in the ranking of Post Evaluation Task 1 on the CodaLab

Tasks Description
Data Preprocessing
Models Choices & Design
Design of Training Processes (for task two only)
Optimizer & Learning Rate Scheduler
Prime Hyperparameters
Results
Discussion
Prospective
License

Tasks Description

Task one - Given one edited headline, design a regression model to predict how funny it is
Task two - Given the original headline and two manually edited versions, design a model to predict which edited version is the funnier of the two

Data Preprocessing

Task One

Convert original headlines into normal sentences (Remove < and /> by applying RE)
Get the edited version of headlines by doing word substitution using RE
Do tokenization and lowercasing for each edited-original headlines pair

Data preprocessing for pre-trained LMs (BERT-liked LMs):
- Version 1 - Concatenate original headlines and new headlines
- Version 2 - Concatenate new headlines and new words
- Version 3 – Contain only new headlines

Task Two

There are 3 versions of data preprocessing:

The normal version
The headlines truncated version
The punctuation removal version

Models Choices & Design

Task One

Two Inputs FFNN
Two Inputs CNN
Two Inputs RNN
Two Inputs Concatenated RNN
Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)

Two Inputs FFNN

This model is a two inputs’ feed forward neural network in which two input matrices representing all the original headlines and their corresponding edited headlines respectively are passed simultaneously to the first so called embedding layer of the model to get the word embedding of the fixed dimension for each word in the headline. Following above the model will do averaging for each headline to get the ‘document representation (vector)’ for each headline. Then these headlines’ vector representations are passed to a combination of three concatenated fully connected layers where the information about “how humour they are” are encoded. The Relu activation is applied after output from each of the first two hidden layers to prevent gradient vanishing and gradient exploding. Finally, all weighted sums or all vector products between the n-th row of the original matrix and the n-th row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1) is returned.

Two Inputs CNN

This model uses text CNN architecture with single windows size instead of FFNN for the regression task. The original headlines tensor and the edited headlines tensor are taken as the two inputs. In the output layer, unlike the normal matrix multiplication, all weighted sums or all vector products between the n-th row of the original matrix and the n-th row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1) is returned.

Two Inputs RNN

This model uses single layer bidirectional RNN architecture for the regression task. It is again the same as Two Inputs CNN that takes two tensors as its inputs and does a row-wise weighted summation in the output layer.

Two Inputs Concatenated RNN

This model is all the same as Two Inputs RNN except it concatenates the two last hidden states for the original headlines and the edited headlines to form a single representation and do a normal matrix multiplication in the output layer.

Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)

Version 1 - Concatenate original headlines and new headlines

Version 2 - Concatenate new headlines and new words

Version 3 – Contain only new headlines

Task Two

Pre-trained LM + a classification layer

Concatenate edited headline 1 and edited headline 2

Design of Training Processes (for task two only)

Version 1:

Training the model “Pre-trained LM + a classification layer” straightly for the real classification task

Version 2 (Fake Task + Real Task):

Firstly, training the model “Pre-trained LM + a regression layer” for a fake regression task on the training dataset
After training well, get rid of the regression layer and add an initialized classification layer on top of the pre-trained LM
Finally training the model for the real classification task

Optimizer & Learning Rate Scheduler

For FFNN, CNN, RNN:

The optimizer AdamW and the scheduler CosineAnnealingLR provided by pytorch

For pre-trained LMs (BERT-liked LMs):

The optimizer AdamW and the scheduler get_linear_schedule_with_warmup from Huggingface transformers

Prime Hyperparameters

Learning Rate
Fine-tuning Rate
Adam Epsilon
Weight Decay
Warmup Ratio
Number of Steps

Results

Task One

Best performance achieved by Two Inputs FFNN

EPOCHS	LRATE	EMBEDDING_DIM	HIDDEN_DIM_1	HIDDEN_DIM_2	HIDDEN_DIM_3	Train Loss	Val. Loss	Test Loss
100	0.145	300	100	50	10	0.575	0.581	0.576

Best performance achieved by Two Inputs CNN

EPOCHS	LRATE	EMBEDDING_DIM	FC_OUT_DIM	N_OUT_CHANNELS	WINDOW_SIZE	DROPOUT	Train Loss	Val. Loss
500	5e-3	50	25	100	3	0.7	0.624	0.661

Best performance achieved by Two Inputs RNN

EPOCHS	LRATE	EMBEDDING_DIM	HIDDEN_DIM	FC_OUTPUT_DIM	BIDIRECTIONAL	DROPOUT	Train Loss	Val. Loss	Test Loss
30	1e-4	50	128	32	Ture	0.3	0.586	0.576	0.571

Best performance achieved by Pre-trained LMs

Without Data Augmentation
- Model: bert_base_uncased
- Inputs structure: new headlines + new words
- Test loss: 0.52937
With Data Augmentation (add “funlines” training dataset)
- Model: bert_base_uncased
- Inputs structure: new headlines + new words
- Test loss: 0.52054 (Best performance achieved among all trials)


T1 Pre-trained LMs Log

Task Two

Version 1: Straightly training the model for the real task


T2 Log 1


T2 Log 2

Version 2: Fake Task Training + Real Task Training


T2 Fake Task Log


T2 Real Task Log

Discussion

Task One

The performance of Two Inputs RNN is just slightly better compared with that of the Two Inputs FFNN (0.5759702196 vs. 0.5751694002) while the time complexity of the Two Inputs RNN is much higher than the Two Inputs FFNN, at some point the current version of Two Inputs RNN is resources-wasted.
The Two Inputs CNN with a single window size performs worse than the Two Inputs FFNN and the Two Inputs RNN, and one of possible reasons is that it only looks at one size of n-gram and hence ignores the knowledge of n-grams with different lengths.

Task Two

For different preprocessing methods, the headlines truncated version and the punctuation removal version have the same performance as the normal one except that truncating headlines will reduce the training time for a single epoch.
The issue of overfitting on the training dataset is hard to overcome when applying BERT-liked pre-trained LMs (Although several methods, such as data augmentation, weight decay and dropout increase have been tried to mitigate this problem.
Surprisingly, the fake task training for pre-trained LMs does not help to improve the performance of the model in real task even a little bit.
With the same hyperparameters setting for the certain task, the performance of the newly proposed pre-trained LM is not necessarily the best.

Prospective

Construct a pretrain LM to do a binary classification task in which the model learns to decide whether a word from the edited new headline is original or edited. Take the embeddings out of the pretrain model and use it to initialize the model for the real regression task. By doing so we expect the embeddings can be informed some knowledge about the relationship between original headlines and edited headlines.
Build up a pretrain LM to do a text translation task on the training dataset and use the embeddings of this model to initialize the model for the real regression task. (Aim to learn the semantics of funniness).
Intuitively thinking the performance of the Two Inputs CNN might be improved by increasing the number of the window sizes (different n-gram filters).
Applying the pre-trained LM Longformer rather than other BERT-liked models for the task two, in which the Longformer has the ‘global attention mask’ and it can probably better model the relationship between the edited word and the other words in a headline (e.g. How important is the edited word for the whole headline in order to make it funnier? / How does the edited word contribute to the meaning of the whole sentence?).

License

This project following MIT License as written in the LICENSE file.

自然语言处理 | 林子杨的个人网站

现代ASR技术解析：从传统模型到大语言模型驱动的新范式

1. 背景

1.1 传统ASR模型的痛点

1.2 大语言模型（LLM）驱动的 ASR 新范式

2. ASR模型的解决方案分析

2.1 Whisper-large-v3-turbo

2.1.1 Whisper的设计

2.1.2 解决的问题

2.1.3 生产缺陷分析

2.1.3.1 “幻觉”（Hallucination）问题

2.1.3.2 时间戳精度有限

2.1.3.3 计算资源要求高

2.1.3.4 实时性瓶颈

2.2 SenseVoice

2.2.1 SenseVoice的设计

2.2.2 解决的问题

2.2.3 生产缺陷分析

2.2.3.1 模型复杂度与维护

2.2.3.2 零样本能力的泛化性

2.2.3.3 开源生态与社区

3. 总结

2020 评估编辑新闻标题的幽默程度

Contents

Tasks Description

Data Preprocessing

Task One

Task Two

Models Choices & Design

Task One

Two Inputs FFNN

Two Inputs CNN

Two Inputs RNN

Two Inputs Concatenated RNN

Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)

Version 1 - Concatenate original headlines and new headlines

Version 2 - Concatenate new headlines and new words

Version 3 – Contain only new headlines

Task Two

Pre-trained LM + a classification layer

Concatenate edited headline 1 and edited headline 2

Design of Training Processes (for task two only)

Version 1:

Version 2 (Fake Task + Real Task):

Optimizer & Learning Rate Scheduler

For FFNN, CNN, RNN:

For pre-trained LMs (BERT-liked LMs):

Prime Hyperparameters

Results

Task One

Best performance achieved by Two Inputs FFNN

Best performance achieved by Two Inputs CNN

Best performance achieved by Two Inputs RNN

Best performance achieved by Pre-trained LMs

Task Two

Version 1: Straightly training the model for the real task

Version 2: Fake Task Training + Real Task Training

Discussion

Task One

Task Two

Prospective

License