<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>自然语言处理 | 林子杨的个人网站</title><link>https://ziyanglin.netlify.app/zh/tags/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86/</link><atom:link href="https://ziyanglin.netlify.app/zh/tags/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86/index.xml" rel="self" type="application/rss+xml"/><description>自然语言处理</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>zh-Hans</language><lastBuildDate>Sat, 28 Jun 2025 13:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>自然语言处理</title><link>https://ziyanglin.netlify.app/zh/tags/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86/</link></image><item><title>现代ASR技术解析：从传统模型到大语言模型驱动的新范式</title><link>https://ziyanglin.netlify.app/zh/post/asr-technology-overview/</link><pubDate>Sat, 28 Jun 2025 13:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/zh/post/asr-technology-overview/</guid><description>&lt;h2 id="1-">1. 背景&lt;/h2>
&lt;h3 id="11-asr">1.1 传统ASR模型的痛点&lt;/h3>
&lt;p>传统的自动语音识别（ASR）模型，如基于隐马尔可夫模型-高斯混合模型（HMM-GMM）或深度神经网络（DNN）的模型，在特定领域和受控环境下表现良好，但面临诸多挑战：&lt;/p>
&lt;ol>
&lt;li>&lt;strong>数据稀疏性&lt;/strong>：对大规模、高质量的标注数据集依赖严重，在低资源语言或特定口音上泛化能力差。&lt;/li>
&lt;li>&lt;strong>鲁棒性不足&lt;/strong>：在嘈杂环境、远场拾音、多人对话等真实场景下，性能会急剧下降。&lt;/li>
&lt;li>&lt;strong>上下文理解缺失&lt;/strong>：模型通常局限于声学特征到文本的直接映射，缺乏对长程上下文、语义和说话人意图的理解，导致识别错误（如同音异义词混淆）。&lt;/li>
&lt;li>&lt;strong>多任务能力有限&lt;/strong>：传统模型通常是单一任务的，如仅支持语音转录，而无法同时完成说话人日志、语种识别、翻译等任务。&lt;/li>
&lt;/ol>
&lt;h3 id="12-llm-asr-">1.2 大语言模型（LLM）驱动的 ASR 新范式&lt;/h3>
&lt;p>近年来，以 &lt;code>Whisper&lt;/code> 为代表的端到端大型 ASR 模型，通过在海量、多样化的无监督或弱监督数据上进行预训练，展现了前所未有的鲁棒性和泛化能力。这些模型通常采用 Encoder-Decoder 架构，将 ASR 任务视为一个序列到序列的翻译问题。&lt;/p>
&lt;p>&lt;strong>典型流程&lt;/strong>：&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;原始音频波形&amp;quot;] --&amp;gt; B[&amp;quot;特征提取（如Log-Mel谱图）&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;文本序列输出&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>这种方法不仅简化了传统 ASR 的复杂流水线，还通过大规模数据学习到了丰富的声学和语言知识，从而在零样本（Zero-shot）场景下也能取得优异表现。&lt;/p>
&lt;h2 id="2-asr">2. ASR模型的解决方案分析&lt;/h2>
&lt;h3 id="21-whisperlargev3turbo">2.1 Whisper-large-v3-turbo&lt;/h3>
&lt;p>&lt;code>Whisper&lt;/code> 是由 OpenAI 开发的预训练 ASR 模型，其 &lt;code>large-v3&lt;/code> 和 &lt;code>large-v3-turbo&lt;/code> 版本是目前业界领先的模型之一。&lt;/p>
&lt;h4 id="211-whisper">2.1.1 Whisper的设计&lt;/h4>
&lt;p>&lt;strong>结构模块&lt;/strong>：&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;音频输入 (30s片段)&amp;quot;] --&amp;gt; B[&amp;quot;Log-Mel谱图&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Encoded Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;预测文本Token&amp;quot;]
subgraph &amp;quot;多任务处理&amp;quot;
E --&amp;gt; G[&amp;quot;转录&amp;quot;]
E --&amp;gt; H[&amp;quot;翻译&amp;quot;]
E --&amp;gt; I[&amp;quot;语种识别&amp;quot;]
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>特点&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>大规模弱监督训练&lt;/strong>：在 68 万小时的多语言、多任务数据上进行训练，覆盖了广泛的口音、背景噪音和技术术语。&lt;/li>
&lt;li>&lt;strong>端到端架构&lt;/strong>：一个统一的 Transformer 模型直接将音频映射到文本，无需外部的语言模型或对齐模块。&lt;/li>
&lt;li>&lt;strong>多任务能力&lt;/strong>：模型能够同时处理多语言语音转录、语音翻译和语种识别。&lt;/li>
&lt;li>&lt;strong>鲁棒性&lt;/strong>：通过对数据进行精心设计的数据增强和混合，模型在各种具有挑战性的条件下都表现出色。&lt;/li>
&lt;li>&lt;strong>Turbo 版本&lt;/strong>：&lt;code>large-v3-turbo&lt;/code> 是 &lt;code>large-v3&lt;/code> 的优化版本，可能在推理速度、计算效率或特定任务性能上有所提升，参数量约为 798M。&lt;/li>
&lt;/ul>
&lt;h4 id="212-">2.1.2 解决的问题&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>目标问题&lt;/th>
&lt;th>Whisper 的应对方案&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>泛化能力差&lt;/td>
&lt;td>在海量、多样化的数据集上进行大规模预训练，覆盖近百种语言。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>鲁棒性不足&lt;/td>
&lt;td>训练数据包含各种背景噪音、口音和说话风格，提升了真实场景下的性能。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>上下文建模弱&lt;/td>
&lt;td>Transformer 架构能够捕捉音频信号中的长程依赖关系。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>部署复杂&lt;/td>
&lt;td>提供多种模型尺寸（从 &lt;code>tiny&lt;/code> 到 &lt;code>large&lt;/code>），并开源了代码和模型权重，方便社区使用和部署。&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="213-">2.1.3 生产缺陷分析&lt;/h4>
&lt;h5 id="2131-hallucination">2.1.3.1 &amp;ldquo;幻觉&amp;rdquo;（Hallucination）问题&lt;/h5>
&lt;ul>
&lt;li>在无语音或噪声片段中，模型有时会生成无意义或重复的文本，这是大型自回归模型的通病。&lt;/li>
&lt;li>这种现象在长音频处理中尤为明显，可能需要额外的后处理逻辑来检测和过滤。&lt;/li>
&lt;/ul>
&lt;h5 id="2132-">2.1.3.2 时间戳精度有限&lt;/h5>
&lt;ul>
&lt;li>模型预测的时间戳是词级别的，但其精度可能不足以满足某些应用（如字幕对齐、语音编辑）的苛刻要求。&lt;/li>
&lt;li>在长段静音或快速语流中，时间戳的准确性会下降。&lt;/li>
&lt;/ul>
&lt;h5 id="2133-">2.1.3.3 计算资源要求高&lt;/h5>
&lt;ul>
&lt;li>&lt;code>large-v3&lt;/code> 模型包含 15.5 亿参数，&lt;code>turbo&lt;/code> 版本也有近 8 亿参数，对计算资源（特别是 GPU 显存）要求较高，不适合在边缘设备上直接运行。&lt;/li>
&lt;li>虽然有量化等优化手段，但在保证性能的同时降低资源消耗仍是一个挑战。&lt;/li>
&lt;/ul>
&lt;h5 id="2134-">2.1.3.4 实时性瓶颈&lt;/h5>
&lt;ul>
&lt;li>模型基于 30 秒的音频窗口进行处理，对于实时流式 ASR 场景，需要设计复杂的滑动窗口和缓存机制，这会引入额外的延迟。&lt;/li>
&lt;/ul>
&lt;h3 id="22-sensevoice">2.2 SenseVoice&lt;/h3>
&lt;p>&lt;code>SenseVoice&lt;/code> 是由阿里巴巴达摩院的语音团队开发的下一代工业级 ASR 模型。与 &lt;code>Whisper&lt;/code> 专注于鲁棒的通用转录不同，&lt;code>SenseVoice&lt;/code> 在设计上更侧重于多功能性、实时性和与下游任务的结合。&lt;/p>
&lt;h4 id="221-sensevoice">2.2.1 SenseVoice的设计&lt;/h4>
&lt;p>&lt;strong>结构模块&lt;/strong>：&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;音频流&amp;quot;] --&amp;gt; B[&amp;quot;FSMN-VAD (语音活动检测)&amp;quot;]
B --&amp;gt; C[&amp;quot;Encoder (如 SAN-M)&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;文本输出&amp;quot;]
subgraph &amp;quot;多任务与控制&amp;quot;
G[&amp;quot;说话人日志&amp;quot;] --&amp;gt; C
H[&amp;quot;情绪识别&amp;quot;] --&amp;gt; C
I[&amp;quot;零样本TTS Prompt&amp;quot;] --&amp;gt; E
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>特点&lt;/strong>：&lt;/p>
&lt;ul>
&lt;li>&lt;strong>统一端到端模型&lt;/strong>：集成了声学模型、语言模型和标点预测，实现了从语音到带标点文本的端到端输出。&lt;/li>
&lt;li>&lt;strong>多任务学习&lt;/strong>：模型不仅能进行语音识别，还能同时输出说话人日志（Diarization）、情绪信息，甚至可以生成用于零样本 TTS 的声学 prompt。&lt;/li>
&lt;li>&lt;strong>流式与非流式一体化&lt;/strong>：通过统一的架构支持流式和非流式两种模式，满足实时和离线场景的需求。&lt;/li>
&lt;li>&lt;strong>与 TTS 联动&lt;/strong>：&lt;code>SenseVoice&lt;/code> 的一个创新点是其输出可以作为 &lt;code>CosyVoice&lt;/code> 等 TTS 模型的 prompt，实现声音的克隆和迁移，打通了 ASR 与 TTS 的闭环。&lt;/li>
&lt;/ul>
&lt;h4 id="222-">2.2.2 解决的问题&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>目标问题&lt;/th>
&lt;th>SenseVoice 的应对方案&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>任务单一，集成困难&lt;/td>
&lt;td>设计为多任务模型，原生支持说话人日志、情绪识别等，简化了对话系统的构建。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>实时性差&lt;/td>
&lt;td>采用高效的流式架构（如 SAN-M），并结合 VAD，实现了低延迟的实时识别。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>缺乏与下游任务的协同&lt;/td>
&lt;td>输出包含了丰富的元信息（如说话人、情绪），并能生成 TTS prompt，实现了 ASR 与 TTS 的深度融合。&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>标点恢复依赖后处理&lt;/td>
&lt;td>将标点预测作为模型的一个内置任务，实现了文本和标点的联合建模。&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="223-">2.2.3 生产缺陷分析&lt;/h4>
&lt;h5 id="2231-">2.2.3.1 模型复杂度与维护&lt;/h5>
&lt;ul>
&lt;li>作为一个集成了多种功能的复杂模型，其训练和维护成本相对较高。&lt;/li>
&lt;li>多任务之间的平衡可能需要精细的调整，以避免某一任务性能的下降。&lt;/li>
&lt;/ul>
&lt;h5 id="2232-">2.2.3.2 零样本能力的泛化性&lt;/h5>
&lt;ul>
&lt;li>虽然支持零样本 TTS prompt 生成，但其声音克隆的效果和稳定性在面对未见过的说话人或复杂声学环境时，可能不如专门的 voice cloning 模型。&lt;/li>
&lt;/ul>
&lt;h5 id="2233-">2.2.3.3 开源生态与社区&lt;/h5>
&lt;ul>
&lt;li>相较于 &lt;code>Whisper&lt;/code> 强大的开源社区和丰富的生态工具，&lt;code>SenseVoice&lt;/code> 作为工业级模型，其开源程度和社区支持可能相对有限，这会影响其在学术界和开发者社区中的普及。&lt;/li>
&lt;/ul>
&lt;h2 id="3-">3. 总结&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Whisper&lt;/strong>：通过大规模弱监督学习，将 ASR 的鲁棒性和泛化能力推向了新的高度。它是一个强大的&lt;strong>通用语音识别器&lt;/strong>，特别适合处理多样化、非受控的音频数据。其设计哲学是&amp;quot;用规模换性能&amp;rdquo;，在零样本和多语言场景下表现卓越。&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SenseVoice&lt;/strong>：代表了 ASR 技术向&lt;strong>多功能、一体化&lt;/strong>方向发展的趋势。它不仅仅是一个识别器，更是一个&lt;strong>对话智能的感知前端&lt;/strong>，旨在为下游任务（如对话系统、TTS）提供更丰富、更实时的输入。其设计哲学是&amp;quot;融合与协同&amp;rdquo;，强调 ASR 在整个智能交互链路中的枢纽作用。&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>总的来说，&lt;code>Whisper&lt;/code> 定义了现代 ASR 的性能基线，而 &lt;code>SenseVoice&lt;/code> 则探索了 ASR 在工业应用中更广阔的可能性。未来的 ASR 技术可能会朝着二者结合的方向发展：既有 &lt;code>Whisper&lt;/code> 的鲁棒性和泛化能力，又有 &lt;code>SenseVoice&lt;/code> 的多任务协同和实时处理能力。&lt;/p></description></item><item><title>2020 评估编辑新闻标题的幽默程度</title><link>https://ziyanglin.netlify.app/zh/project/my2020_nlp_funniness_estimation/</link><pubDate>Mon, 27 Jul 2020 04:56:23 +0100</pubDate><guid>https://ziyanglin.netlify.app/zh/project/my2020_nlp_funniness_estimation/</guid><description>&lt;p>This project aims to develop potential solutions for the tasks rised by the competition
&lt;a href="https://competitions.codalab.org/competitions/20970#learn_the_details" title="competition">&lt;code>Assessing the Funniness of Edited News Headlines (SemEval-2020)&lt;/code>&lt;/a> on the platform &lt;a href="https://competitions.codalab.org" title="competition">CodaLab&lt;/a>&lt;/p>
&lt;p>As of July 26, 2020, the test result of my trained model (&amp;lsquo;bert-base-uncased&amp;rsquo; from the &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>) &lt;code>ranked third&lt;/code> in the ranking of
Post Evaluation Task 1 on the CodaLab&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_ranking.png" width="600" />
&lt;/p>
&lt;hr>
&lt;h2 id="contents">Contents&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="#tasks-description">Tasks Description&lt;/a>&lt;/li>
&lt;li>&lt;a href="#data-preprocessing">Data Preprocessing&lt;/a>&lt;/li>
&lt;li>&lt;a href="#models-choices--design">Models Choices &amp;amp; Design&lt;/a>&lt;/li>
&lt;li>&lt;a href="#design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prime-hyperparameters">Prime Hyperparameters&lt;/a>&lt;/li>
&lt;li>&lt;a href="#results">Results&lt;/a>&lt;/li>
&lt;li>&lt;a href="#discussion">Discussion&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prospective">Prospective&lt;/a>&lt;/li>
&lt;li>&lt;a href="#License">License&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="tasks-description">Tasks Description&lt;/h2>
&lt;ul>
&lt;li>&lt;code>Task one&lt;/code> - Given one edited headline, design a regression model to predict how funny it is&lt;/li>
&lt;li>&lt;code>Task two&lt;/code> - Given the original headline and two manually edited versions, design a model to predict which edited version is the funnier of the two&lt;/li>
&lt;/ul>
&lt;h2 id="data-preprocessing">Data Preprocessing&lt;/h2>
&lt;h3 id="task-one">Task One&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Convert original headlines into normal sentences (Remove &lt;code>&amp;lt;&lt;/code> and &lt;code>/&amp;gt;&lt;/code> by applying RE)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Get the edited version of headlines by doing word substitution using RE&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Do tokenization and lowercasing for each edited-original headlines pair&lt;/p>
&lt;p>Data preprocessing for pre-trained LMs (BERT-liked LMs):&lt;/p>
&lt;ul>
&lt;li>Version 1 - Concatenate original headlines and new headlines&lt;/li>
&lt;li>Version 2 - Concatenate new headlines and new words&lt;/li>
&lt;li>Version 3 – Contain only new headlines&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="task-two">Task Two&lt;/h3>
&lt;p>There are 3 versions of data preprocessing:&lt;/p>
&lt;ul>
&lt;li>The normal version&lt;/li>
&lt;li>The headlines truncated version&lt;/li>
&lt;li>The punctuation removal version&lt;/li>
&lt;/ul>
&lt;h2 id="models-choices--design">Models Choices &amp;amp; Design&lt;/h2>
&lt;h3 id="task-one1">Task One&lt;/h3>
&lt;ul>
&lt;li>Two Inputs FFNN&lt;/li>
&lt;li>Two Inputs CNN&lt;/li>
&lt;li>Two Inputs RNN&lt;/li>
&lt;li>Two Inputs Concatenated RNN&lt;/li>
&lt;li>Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/li>
&lt;/ul>
&lt;h4 id="two-inputs-ffnn">Two Inputs FFNN&lt;/h4>
&lt;p>This model is a two inputs’ feed forward neural network in which two input matrices representing all the original headlines and their corresponding
edited headlines respectively are passed simultaneously to the first so called embedding layer of the model to get the word embedding of the fixed
dimension for each word in the headline. Following above the model will do averaging for each headline to get the ‘document representation (vector)’
for each headline. Then these headlines’ vector representations are passed to a combination of three concatenated fully connected layers where the
information about “how humour they are” are encoded. The Relu activation is applied after output from each of the first two hidden layers to prevent
gradient vanishing and gradient exploding. Finally, all weighted sums or all vector products between the n-th row of the original matrix and the n-th
row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1) is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_FFNN.png" width="700" />
&lt;/p>
&lt;h4 id="two-inputs-cnn">Two Inputs CNN&lt;/h4>
&lt;p>This model uses text CNN architecture with single windows size instead of FFNN for the regression task. The original headlines tensor and the edited
headlines tensor are taken as the two inputs. In the output layer, unlike the normal matrix multiplication, all weighted sums or all vector products
between the n-th row of the original matrix and the n-th row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1)
is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_cnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-rnn">Two Inputs RNN&lt;/h4>
&lt;p>This model uses single layer bidirectional RNN architecture for the regression task. It is again the same as Two Inputs CNN that takes two tensors as its
inputs and does a row-wise weighted summation in the output layer.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_rnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-concatenated-rnn">Two Inputs Concatenated RNN&lt;/h4>
&lt;p>This model is all the same as Two Inputs RNN except it concatenates the two last hidden states for the original headlines and the edited headlines to form
a single representation and do a normal matrix multiplication in the output layer.&lt;/p>
&lt;h4 id="pretrained-lm--a-regression-layer-lms-applied-bert-albert-xlnet-electra">Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/h4>
&lt;h5 id="version-1--concatenate-original-headlines-and-new-headlines">Version 1 - Concatenate original headlines and new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_1.png" width="600" />
&lt;/p>
&lt;h5 id="version-2--concatenate-new-headlines-and-new-words">Version 2 - Concatenate new headlines and new words&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_2.png" width="600" />
&lt;/p>
&lt;h5 id="version-3--contain-only-new-headlines">Version 3 – Contain only new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_3.png" width="600" />
&lt;/p>
&lt;h3 id="task-two1">Task Two&lt;/h3>
&lt;h4 id="pretrained-lm--a-classification-layer">Pre-trained LM + a classification layer&lt;/h4>
&lt;h5 id="concatenate-edited-headline-1-and-edited-headline-2">Concatenate edited headline 1 and edited headline 2&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/2_seq_inputs_lm.png" width="650" />
&lt;/p>
&lt;h2 id="design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/h2>
&lt;h3 id="version-1">Version 1:&lt;/h3>
&lt;ul>
&lt;li>Training the model “Pre-trained LM + a classification layer”
straightly for the real classification task&lt;/li>
&lt;/ul>
&lt;h3 id="version-2-fake-task--real-task">Version 2 (Fake Task + Real Task):&lt;/h3>
&lt;ul>
&lt;li>Firstly, training the model “Pre-trained LM + a regression layer” for a fake regression task on the training dataset&lt;/li>
&lt;li>After training well, get rid of the regression layer and add an initialized classification layer on top of the pre-trained LM&lt;/li>
&lt;li>Finally training the model for the real classification task&lt;/li>
&lt;/ul>
&lt;h2 id="optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/h2>
&lt;h3 id="for-ffnn-cnn-rnn">For FFNN, CNN, RNN:&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>CosineAnnealingLR&lt;/code> provided by pytorch&lt;/li>
&lt;/ul>
&lt;h3 id="for-pretrained-lms-bertliked-lms">For pre-trained LMs (BERT-liked LMs):&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>get_linear_schedule_with_warmup&lt;/code> from &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="prime-hyperparameters">Prime Hyperparameters&lt;/h2>
&lt;ul>
&lt;li>Learning Rate&lt;/li>
&lt;li>Fine-tuning Rate&lt;/li>
&lt;li>Adam Epsilon&lt;/li>
&lt;li>Weight Decay&lt;/li>
&lt;li>Warmup Ratio&lt;/li>
&lt;li>Number of Steps&lt;/li>
&lt;/ul>
&lt;h2 id="results">Results&lt;/h2>
&lt;h3 id="task-one2">Task One&lt;/h3>
&lt;h4 id="best-performance-achieved-by-two-inputs-ffnn">Best performance achieved by Two Inputs FFNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM_1&lt;/th>
&lt;th>HIDDEN_DIM_2&lt;/th>
&lt;th>HIDDEN_DIM_3&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>100&lt;/td>
&lt;td>0.145&lt;/td>
&lt;td>300&lt;/td>
&lt;td>100&lt;/td>
&lt;td>50&lt;/td>
&lt;td>10&lt;/td>
&lt;td>0.575&lt;/td>
&lt;td>0.581&lt;/td>
&lt;td>0.576&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-cnn">Best performance achieved by Two Inputs CNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>FC_OUT_DIM&lt;/th>
&lt;th>N_OUT_CHANNELS&lt;/th>
&lt;th>WINDOW_SIZE&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>500&lt;/td>
&lt;td>5e-3&lt;/td>
&lt;td>50&lt;/td>
&lt;td>25&lt;/td>
&lt;td>100&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0.7&lt;/td>
&lt;td>0.624&lt;/td>
&lt;td>0.661&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-rnn">Best performance achieved by Two Inputs RNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM&lt;/th>
&lt;th>FC_OUTPUT_DIM&lt;/th>
&lt;th>BIDIRECTIONAL&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>30&lt;/td>
&lt;td>1e-4&lt;/td>
&lt;td>50&lt;/td>
&lt;td>128&lt;/td>
&lt;td>32&lt;/td>
&lt;td>Ture&lt;/td>
&lt;td>0.3&lt;/td>
&lt;td>0.586&lt;/td>
&lt;td>0.576&lt;/td>
&lt;td>0.571&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-pretrained-lms">Best performance achieved by Pre-trained LMs&lt;/h4>
&lt;ul>
&lt;li>Without Data Augmentation
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>Test loss: 0.52937&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>With Data Augmentation (add “funlines” training dataset)
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>&lt;code>Test loss: 0.52054 (Best performance achieved among all trials)&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_log.png" alt="task1_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T1 Pre-trained LMs Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="task-two2">Task Two&lt;/h3>
&lt;h4 id="version-1-straightly-training-the-model-for-the-real-task">Version 1: Straightly training the model for the real task&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log1.png" alt="task2_v1_log1">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 1&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log2.png" alt="task2_v1_log2">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 2&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="version-2-fake-task-training--real-task-training">Version 2: Fake Task Training + Real Task Training&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_f_log.png" alt="task2_v2_f_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Fake Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_r_log.png" alt="task2_v2_r_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Real Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;h3 id="task-one3">Task One&lt;/h3>
&lt;ul>
&lt;li>The performance of Two Inputs RNN is just slightly better compared with that of the Two Inputs FFNN (0.5759702196 vs. 0.5751694002) while the time complexity
of the Two Inputs RNN is much higher than the Two Inputs FFNN, at some point the current version of Two Inputs RNN is resources-wasted.&lt;/li>
&lt;li>The Two Inputs CNN with a single window size performs worse than the Two Inputs FFNN and the Two Inputs RNN, and one of possible reasons is that it only looks
at one size of n-gram and hence ignores the knowledge of n-grams with different lengths.&lt;/li>
&lt;/ul>
&lt;h3 id="task-two3">Task Two&lt;/h3>
&lt;ul>
&lt;li>For different preprocessing methods, the headlines truncated version and the punctuation removal version have the same performance as the normal one except
that truncating headlines will reduce the training time for a single epoch.&lt;/li>
&lt;li>The issue of overfitting on the training dataset is hard to overcome when applying BERT-liked pre-trained LMs (Although several methods, such as data
augmentation, weight decay and dropout increase have been tried to mitigate this problem.&lt;/li>
&lt;li>Surprisingly, the fake task training for pre-trained LMs does not help to improve the performance of the model in real task even a little bit.&lt;/li>
&lt;li>With the same hyperparameters setting for the certain task, the performance of the newly proposed pre-trained LM is not necessarily the best.&lt;/li>
&lt;/ul>
&lt;h2 id="prospective">Prospective&lt;/h2>
&lt;ul>
&lt;li>Construct a pretrain LM to do a binary classification task in which the model learns to decide whether a word from the edited new headline is original or
edited. Take the embeddings out of the pretrain model and use it to initialize the model for the real regression task. By doing so we expect the embeddings can
be informed some knowledge about the relationship between original headlines and edited headlines.&lt;/li>
&lt;li>Build up a pretrain LM to do a text translation task on the training dataset and use the embeddings of this model to initialize the model for the real
regression task. (Aim to learn the semantics of funniness).&lt;/li>
&lt;li>Intuitively thinking the performance of the Two Inputs CNN might be improved by increasing the number of the window sizes (different n-gram filters).&lt;/li>
&lt;li>Applying the pre-trained LM Longformer rather than other BERT-liked models for the task two, in which the Longformer has the ‘global attention mask’ and it
can probably better model the relationship between the edited word and the other words in a headline (e.g. &lt;code>How important is the edited word for the whole headline in order to make it funnier?&lt;/code> / &lt;code>How does the edited word contribute to the meaning of the whole sentence?&lt;/code>).&lt;/li>
&lt;/ul>
&lt;h2 id="license">License&lt;/h2>
&lt;p>This project following MIT License as written in the &lt;a href="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/LICENSE">LICENSE&lt;/a> file.&lt;/p>
&lt;hr></description></item></channel></rss>