<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Natural Language Processing | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/natural-language-processing/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/natural-language-processing/index.xml" rel="self" type="application/rss+xml"/><description>Natural Language Processing</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Sat, 28 Jun 2025 13:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Natural Language Processing</title><link>https://ziyanglin.netlify.app/en/tags/natural-language-processing/</link></image><item><title>Modern ASR Technology Analysis: From Traditional Models to LLM-Driven New Paradigms</title><link>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</link><pubDate>Sat, 28 Jun 2025 13:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</guid><description>&lt;h2 id="1-background">1. Background&lt;/h2>
&lt;h3 id="11-pain-points-of-traditional-asr-models">1.1 Pain Points of Traditional ASR Models&lt;/h3>
&lt;p>Traditional Automatic Speech Recognition (ASR) models, such as those based on Hidden Markov Models-Gaussian Mixture Models (HMM-GMM) or Deep Neural Networks (DNN), perform well in specific domains and controlled environments but face numerous challenges:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Sparsity&lt;/strong>: Heavy dependence on large-scale, high-quality labeled datasets, resulting in poor generalization to low-resource languages or specific accents.&lt;/li>
&lt;li>&lt;strong>Insufficient Robustness&lt;/strong>: Performance drops dramatically in noisy environments, far-field audio capture, multi-person conversations, and other real-world scenarios.&lt;/li>
&lt;li>&lt;strong>Lack of Contextual Understanding&lt;/strong>: Models are typically limited to direct mapping from acoustic features to text, lacking understanding of long-range context, semantics, and speaker intent, leading to recognition errors (such as homophone confusion).&lt;/li>
&lt;li>&lt;strong>Limited Multi-task Capabilities&lt;/strong>: Traditional models are usually single-task oriented, supporting only speech transcription without simultaneously handling speaker diarization, language identification, translation, and other tasks.&lt;/li>
&lt;/ol>
&lt;h3 id="12-large-language-model-llm-driven-asr-new-paradigm">1.2 Large Language Model (LLM) Driven ASR New Paradigm&lt;/h3>
&lt;p>In recent years, end-to-end large ASR models represented by &lt;code>Whisper&lt;/code> have demonstrated unprecedented robustness and generalization capabilities through pretraining on massive, diverse unsupervised or weakly supervised data. These models typically adopt an Encoder-Decoder architecture, treating ASR as a sequence-to-sequence translation problem.&lt;/p>
&lt;p>&lt;strong>Typical Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Raw Audio Waveform&amp;quot;] --&amp;gt; B[&amp;quot;Feature Extraction (e.g., Log-Mel Spectrogram)&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Sequence Output&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>This approach not only simplifies the complex pipeline of traditional ASR but also learns rich acoustic and linguistic knowledge through large-scale data, enabling excellent performance even in zero-shot scenarios.&lt;/p>
&lt;h2 id="2-analysis-of-asr-model-solutions">2. Analysis of ASR Model Solutions&lt;/h2>
&lt;h3 id="21-whisperlargev3turbo">2.1 Whisper-large-v3-turbo&lt;/h3>
&lt;p>&lt;code>Whisper&lt;/code> is a pretrained ASR model developed by OpenAI, with its &lt;code>large-v3&lt;/code> and &lt;code>large-v3-turbo&lt;/code> versions being among the industry-leading models.&lt;/p>
&lt;h4 id="211-whisper-design">2.1.1 Whisper Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Input (30s segment)&amp;quot;] --&amp;gt; B[&amp;quot;Log-Mel Spectrogram&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Encoded Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Predicted Text Tokens&amp;quot;]
subgraph &amp;quot;Multi-task Processing&amp;quot;
E --&amp;gt; G[&amp;quot;Transcription&amp;quot;]
E --&amp;gt; H[&amp;quot;Translation&amp;quot;]
E --&amp;gt; I[&amp;quot;Language Identification&amp;quot;]
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Large-scale Weakly Supervised Training&lt;/strong>: Trained on 680,000 hours of multilingual, multi-task data, covering a wide range of accents, background noise, and technical terminology.&lt;/li>
&lt;li>&lt;strong>End-to-end Architecture&lt;/strong>: A unified Transformer model directly maps audio to text, without requiring external language models or alignment modules.&lt;/li>
&lt;li>&lt;strong>Multi-task Capability&lt;/strong>: The model can simultaneously handle multilingual speech transcription, speech translation, and language identification.&lt;/li>
&lt;li>&lt;strong>Robustness&lt;/strong>: Through carefully designed data augmentation and mixing, the model performs excellently under various challenging conditions.&lt;/li>
&lt;li>&lt;strong>Turbo Version&lt;/strong>: &lt;code>large-v3-turbo&lt;/code> is an optimized version of &lt;code>large-v3&lt;/code>, potentially offering improvements in inference speed, computational efficiency, or specific task performance, with approximately 798M parameters.&lt;/li>
&lt;/ul>
&lt;h4 id="212-problems-solved">2.1.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>Whisper's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Poor Generalization&lt;/td>
&lt;td>Large-scale pretraining on massive, diverse datasets covering nearly a hundred languages.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Insufficient Robustness&lt;/td>
&lt;td>Training data includes various background noise, accents, and speaking styles, enhancing performance in real-world scenarios.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Weak Contextual Modeling&lt;/td>
&lt;td>Transformer architecture captures long-range dependencies in audio signals.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Complex Deployment&lt;/td>
&lt;td>Provides multiple model sizes (from &lt;code>tiny&lt;/code> to &lt;code>large&lt;/code>), with open-sourced code and model weights, facilitating community use and deployment.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="213-production-defect-analysis">2.1.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2131-hallucination-issues">2.1.3.1 Hallucination Issues&lt;/h5>
&lt;ul>
&lt;li>In segments with no speech or noise, the model sometimes generates meaningless or repetitive text, a common issue with large autoregressive models.&lt;/li>
&lt;li>This phenomenon is particularly noticeable in long audio processing and may require additional post-processing logic for detection and filtering.&lt;/li>
&lt;/ul>
&lt;h5 id="2132-limited-timestamp-precision">2.1.3.2 Limited Timestamp Precision&lt;/h5>
&lt;ul>
&lt;li>The model predicts word-level timestamps, but their precision may not meet the stringent requirements of certain applications (such as subtitle alignment, speech editing).&lt;/li>
&lt;li>Timestamp accuracy decreases during long periods of silence or rapid speech flow.&lt;/li>
&lt;/ul>
&lt;h5 id="2133-high-computational-resource-requirements">2.1.3.3 High Computational Resource Requirements&lt;/h5>
&lt;ul>
&lt;li>The &lt;code>large-v3&lt;/code> model contains 1.55 billion parameters, and the &lt;code>turbo&lt;/code> version has nearly 800 million parameters, demanding significant computational resources (especially GPU memory), making it unsuitable for direct execution on edge devices.&lt;/li>
&lt;li>Although optimization techniques like quantization exist, balancing performance while reducing resource consumption remains a challenge.&lt;/li>
&lt;/ul>
&lt;h5 id="2134-realtime-processing-bottlenecks">2.1.3.4 Real-time Processing Bottlenecks&lt;/h5>
&lt;ul>
&lt;li>The model processes 30-second audio windows, requiring complex sliding window and caching mechanisms for real-time streaming ASR scenarios, which introduces additional latency.&lt;/li>
&lt;/ul>
&lt;h3 id="22-sensevoice">2.2 SenseVoice&lt;/h3>
&lt;p>&lt;code>SenseVoice&lt;/code> is a next-generation industrial-grade ASR model developed by Alibaba DAMO Academy's speech team. Unlike &lt;code>Whisper&lt;/code>, which focuses on robust general transcription, &lt;code>SenseVoice&lt;/code> emphasizes multi-functionality, real-time processing, and integration with downstream tasks.&lt;/p>
&lt;h4 id="221-sensevoice-design">2.2.1 SenseVoice Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Stream&amp;quot;] --&amp;gt; B[&amp;quot;FSMN-VAD (Voice Activity Detection)&amp;quot;]
B --&amp;gt; C[&amp;quot;Encoder (e.g., SAN-M)&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Output&amp;quot;]
subgraph &amp;quot;Multi-task and Control&amp;quot;
G[&amp;quot;Speaker Diarization&amp;quot;] --&amp;gt; C
H[&amp;quot;Emotion Recognition&amp;quot;] --&amp;gt; C
I[&amp;quot;Zero-shot TTS Prompt&amp;quot;] --&amp;gt; E
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified End-to-end Model&lt;/strong>: Integrates acoustic model, language model, and punctuation prediction, achieving end-to-end output from speech to punctuated text.&lt;/li>
&lt;li>&lt;strong>Multi-task Learning&lt;/strong>: The model not only performs speech recognition but also simultaneously outputs speaker diarization, emotional information, and can even generate acoustic prompts for zero-shot TTS.&lt;/li>
&lt;li>&lt;strong>Streaming and Non-streaming Integration&lt;/strong>: Supports both streaming and non-streaming modes through a unified architecture, meeting the needs of real-time and offline scenarios.&lt;/li>
&lt;li>&lt;strong>TTS Integration&lt;/strong>: One innovation of &lt;code>SenseVoice&lt;/code> is that its output can serve as a prompt for TTS models like &lt;code>CosyVoice&lt;/code>, enabling voice cloning and transfer, closing the loop between ASR and TTS.&lt;/li>
&lt;/ul>
&lt;h4 id="222-problems-solved">2.2.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>SenseVoice's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Single-task Limitation, Integration Difficulties&lt;/td>
&lt;td>Designed as a multi-task model, natively supporting speaker diarization, emotion recognition, etc., simplifying dialogue system construction.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor Real-time Performance&lt;/td>
&lt;td>Adopts efficient streaming architecture (such as SAN-M), combined with VAD, achieving low-latency real-time recognition.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of Coordination with Downstream Tasks&lt;/td>
&lt;td>Output includes rich meta-information (such as speaker, emotion) and can generate TTS prompts, achieving deep integration between ASR and TTS.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Punctuation Restoration Dependent on Post-processing&lt;/td>
&lt;td>Incorporates punctuation prediction as a built-in task, achieving joint modeling of text and punctuation.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="223-production-defect-analysis">2.2.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2231-model-complexity-and-maintenance">2.2.3.1 Model Complexity and Maintenance&lt;/h5>
&lt;ul>
&lt;li>As a complex model integrating multiple functions, its training and maintenance costs are relatively high.&lt;/li>
&lt;li>Balancing multiple tasks may require fine-tuning to avoid performance degradation in any single task.&lt;/li>
&lt;/ul>
&lt;h5 id="2232-generalization-of-zeroshot-capabilities">2.2.3.2 Generalization of Zero-shot Capabilities&lt;/h5>
&lt;ul>
&lt;li>Although it supports zero-shot TTS prompt generation, its voice cloning effect and stability when facing unseen speakers or complex acoustic environments may not match specialized voice cloning models.&lt;/li>
&lt;/ul>
&lt;h5 id="2233-opensource-ecosystem-and-community">2.2.3.3 Open-source Ecosystem and Community&lt;/h5>
&lt;ul>
&lt;li>Compared to &lt;code>Whisper&lt;/code>'s strong open-source community and rich ecosystem tools, &lt;code>SenseVoice&lt;/code>, as an industrial-grade model, may have limited open-source availability and community support, affecting its popularity in academic and developer communities.&lt;/li>
&lt;/ul>
&lt;h2 id="3-conclusion">3. Conclusion&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Whisper&lt;/strong>: Through large-scale weakly supervised learning, it has pushed the robustness and generalization capabilities of ASR to new heights. It is a powerful &lt;strong>general-purpose speech recognizer&lt;/strong>, particularly suitable for processing diverse, uncontrolled audio data. Its design philosophy is &amp;ldquo;trading scale for performance,&amp;rdquo; excelling in zero-shot and multilingual scenarios.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SenseVoice&lt;/strong>: Represents the trend of ASR technology developing towards &lt;strong>multi-functionality and integration&lt;/strong>. It is not just a recognizer but a &lt;strong>perceptual frontend for conversational intelligence&lt;/strong>, aimed at providing richer, more real-time input for downstream tasks (such as dialogue systems, TTS). Its design philosophy is &amp;ldquo;fusion and collaboration,&amp;rdquo; emphasizing ASR's pivotal role in the entire intelligent interaction chain.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>In summary, &lt;code>Whisper&lt;/code> defines the performance baseline for modern ASR, while &lt;code>SenseVoice&lt;/code> explores broader possibilities for ASR in industrial applications. Future ASR technology may develop towards combining the strengths of both: having both the robustness and generalization capabilities of &lt;code>Whisper&lt;/code> and the multi-task collaboration and real-time processing capabilities of &lt;code>SenseVoice&lt;/code>.&lt;/p></description></item><item><title>2020 Assessing the Funniness of Edited News Headlines</title><link>https://ziyanglin.netlify.app/en/project/my2020_nlp_funniness_estimation/</link><pubDate>Mon, 27 Jul 2020 04:56:23 +0100</pubDate><guid>https://ziyanglin.netlify.app/en/project/my2020_nlp_funniness_estimation/</guid><description>&lt;p>This project aims to develop potential solutions for the tasks rised by the competition
&lt;a href="https://competitions.codalab.org/competitions/20970#learn_the_details" title="competition">&lt;code>Assessing the Funniness of Edited News Headlines (SemEval-2020)&lt;/code>&lt;/a> on the platform &lt;a href="https://competitions.codalab.org" title="competition">CodaLab&lt;/a>&lt;/p>
&lt;p>As of July 26, 2020, the test result of my trained model (&amp;lsquo;bert-base-uncased&amp;rsquo; from the &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>) &lt;code>ranked third&lt;/code> in the ranking of
Post Evaluation Task 1 on the CodaLab&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_ranking.png" width="600" />
&lt;/p>
&lt;hr>
&lt;h2 id="contents">Contents&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="#tasks-description">Tasks Description&lt;/a>&lt;/li>
&lt;li>&lt;a href="#data-preprocessing">Data Preprocessing&lt;/a>&lt;/li>
&lt;li>&lt;a href="#models-choices--design">Models Choices &amp;amp; Design&lt;/a>&lt;/li>
&lt;li>&lt;a href="#design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prime-hyperparameters">Prime Hyperparameters&lt;/a>&lt;/li>
&lt;li>&lt;a href="#results">Results&lt;/a>&lt;/li>
&lt;li>&lt;a href="#discussion">Discussion&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prospective">Prospective&lt;/a>&lt;/li>
&lt;li>&lt;a href="#License">License&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="tasks-description">Tasks Description&lt;/h2>
&lt;ul>
&lt;li>&lt;code>Task one&lt;/code> - Given one edited headline, design a regression model to predict how funny it is&lt;/li>
&lt;li>&lt;code>Task two&lt;/code> - Given the original headline and two manually edited versions, design a model to predict which edited version is the funnier of the two&lt;/li>
&lt;/ul>
&lt;h2 id="data-preprocessing">Data Preprocessing&lt;/h2>
&lt;h3 id="task-one">Task One&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Convert original headlines into normal sentences (Remove &lt;code>&amp;lt;&lt;/code> and &lt;code>/&amp;gt;&lt;/code> by applying RE)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Get the edited version of headlines by doing word substitution using RE&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Do tokenization and lowercasing for each edited-original headlines pair&lt;/p>
&lt;p>Data preprocessing for pre-trained LMs (BERT-liked LMs):&lt;/p>
&lt;ul>
&lt;li>Version 1 - Concatenate original headlines and new headlines&lt;/li>
&lt;li>Version 2 - Concatenate new headlines and new words&lt;/li>
&lt;li>Version 3 – Contain only new headlines&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="task-two">Task Two&lt;/h3>
&lt;p>There are 3 versions of data preprocessing:&lt;/p>
&lt;ul>
&lt;li>The normal version&lt;/li>
&lt;li>The headlines truncated version&lt;/li>
&lt;li>The punctuation removal version&lt;/li>
&lt;/ul>
&lt;h2 id="models-choices--design">Models Choices &amp;amp; Design&lt;/h2>
&lt;h3 id="task-one1">Task One&lt;/h3>
&lt;ul>
&lt;li>Two Inputs FFNN&lt;/li>
&lt;li>Two Inputs CNN&lt;/li>
&lt;li>Two Inputs RNN&lt;/li>
&lt;li>Two Inputs Concatenated RNN&lt;/li>
&lt;li>Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/li>
&lt;/ul>
&lt;h4 id="two-inputs-ffnn">Two Inputs FFNN&lt;/h4>
&lt;p>This model is a two inputs’ feed forward neural network in which two input matrices representing all the original headlines and their corresponding
edited headlines respectively are passed simultaneously to the first so called embedding layer of the model to get the word embedding of the fixed
dimension for each word in the headline. Following above the model will do averaging for each headline to get the ‘document representation (vector)’
for each headline. Then these headlines’ vector representations are passed to a combination of three concatenated fully connected layers where the
information about “how humour they are” are encoded. The Relu activation is applied after output from each of the first two hidden layers to prevent
gradient vanishing and gradient exploding. Finally, all weighted sums or all vector products between the n-th row of the original matrix and the n-th
row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1) is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_FFNN.png" width="700" />
&lt;/p>
&lt;h4 id="two-inputs-cnn">Two Inputs CNN&lt;/h4>
&lt;p>This model uses text CNN architecture with single windows size instead of FFNN for the regression task. The original headlines tensor and the edited
headlines tensor are taken as the two inputs. In the output layer, unlike the normal matrix multiplication, all weighted sums or all vector products
between the n-th row of the original matrix and the n-th row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1)
is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_cnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-rnn">Two Inputs RNN&lt;/h4>
&lt;p>This model uses single layer bidirectional RNN architecture for the regression task. It is again the same as Two Inputs CNN that takes two tensors as its
inputs and does a row-wise weighted summation in the output layer.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_rnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-concatenated-rnn">Two Inputs Concatenated RNN&lt;/h4>
&lt;p>This model is all the same as Two Inputs RNN except it concatenates the two last hidden states for the original headlines and the edited headlines to form
a single representation and do a normal matrix multiplication in the output layer.&lt;/p>
&lt;h4 id="pretrained-lm--a-regression-layer-lms-applied-bert-albert-xlnet-electra">Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/h4>
&lt;h5 id="version-1--concatenate-original-headlines-and-new-headlines">Version 1 - Concatenate original headlines and new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_1.png" width="600" />
&lt;/p>
&lt;h5 id="version-2--concatenate-new-headlines-and-new-words">Version 2 - Concatenate new headlines and new words&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_2.png" width="600" />
&lt;/p>
&lt;h5 id="version-3--contain-only-new-headlines">Version 3 – Contain only new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_3.png" width="600" />
&lt;/p>
&lt;h3 id="task-two1">Task Two&lt;/h3>
&lt;h4 id="pretrained-lm--a-classification-layer">Pre-trained LM + a classification layer&lt;/h4>
&lt;h5 id="concatenate-edited-headline-1-and-edited-headline-2">Concatenate edited headline 1 and edited headline 2&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/2_seq_inputs_lm.png" width="650" />
&lt;/p>
&lt;h2 id="design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/h2>
&lt;h3 id="version-1">Version 1:&lt;/h3>
&lt;ul>
&lt;li>Training the model “Pre-trained LM + a classification layer”
straightly for the real classification task&lt;/li>
&lt;/ul>
&lt;h3 id="version-2-fake-task--real-task">Version 2 (Fake Task + Real Task):&lt;/h3>
&lt;ul>
&lt;li>Firstly, training the model “Pre-trained LM + a regression layer” for a fake regression task on the training dataset&lt;/li>
&lt;li>After training well, get rid of the regression layer and add an initialized classification layer on top of the pre-trained LM&lt;/li>
&lt;li>Finally training the model for the real classification task&lt;/li>
&lt;/ul>
&lt;h2 id="optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/h2>
&lt;h3 id="for-ffnn-cnn-rnn">For FFNN, CNN, RNN:&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>CosineAnnealingLR&lt;/code> provided by pytorch&lt;/li>
&lt;/ul>
&lt;h3 id="for-pretrained-lms-bertliked-lms">For pre-trained LMs (BERT-liked LMs):&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>get_linear_schedule_with_warmup&lt;/code> from &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="prime-hyperparameters">Prime Hyperparameters&lt;/h2>
&lt;ul>
&lt;li>Learning Rate&lt;/li>
&lt;li>Fine-tuning Rate&lt;/li>
&lt;li>Adam Epsilon&lt;/li>
&lt;li>Weight Decay&lt;/li>
&lt;li>Warmup Ratio&lt;/li>
&lt;li>Number of Steps&lt;/li>
&lt;/ul>
&lt;h2 id="results">Results&lt;/h2>
&lt;h3 id="task-one2">Task One&lt;/h3>
&lt;h4 id="best-performance-achieved-by-two-inputs-ffnn">Best performance achieved by Two Inputs FFNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM_1&lt;/th>
&lt;th>HIDDEN_DIM_2&lt;/th>
&lt;th>HIDDEN_DIM_3&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>100&lt;/td>
&lt;td>0.145&lt;/td>
&lt;td>300&lt;/td>
&lt;td>100&lt;/td>
&lt;td>50&lt;/td>
&lt;td>10&lt;/td>
&lt;td>0.575&lt;/td>
&lt;td>0.581&lt;/td>
&lt;td>0.576&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-cnn">Best performance achieved by Two Inputs CNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>FC_OUT_DIM&lt;/th>
&lt;th>N_OUT_CHANNELS&lt;/th>
&lt;th>WINDOW_SIZE&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>500&lt;/td>
&lt;td>5e-3&lt;/td>
&lt;td>50&lt;/td>
&lt;td>25&lt;/td>
&lt;td>100&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0.7&lt;/td>
&lt;td>0.624&lt;/td>
&lt;td>0.661&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-rnn">Best performance achieved by Two Inputs RNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM&lt;/th>
&lt;th>FC_OUTPUT_DIM&lt;/th>
&lt;th>BIDIRECTIONAL&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>30&lt;/td>
&lt;td>1e-4&lt;/td>
&lt;td>50&lt;/td>
&lt;td>128&lt;/td>
&lt;td>32&lt;/td>
&lt;td>Ture&lt;/td>
&lt;td>0.3&lt;/td>
&lt;td>0.586&lt;/td>
&lt;td>0.576&lt;/td>
&lt;td>0.571&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-pretrained-lms">Best performance achieved by Pre-trained LMs&lt;/h4>
&lt;ul>
&lt;li>Without Data Augmentation
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>Test loss: 0.52937&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>With Data Augmentation (add “funlines” training dataset)
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>&lt;code>Test loss: 0.52054 (Best performance achieved among all trials)&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_log.png" alt="task1_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T1 Pre-trained LMs Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="task-two2">Task Two&lt;/h3>
&lt;h4 id="version-1-straightly-training-the-model-for-the-real-task">Version 1: Straightly training the model for the real task&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log1.png" alt="task2_v1_log1">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 1&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log2.png" alt="task2_v1_log2">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 2&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="version-2-fake-task-training--real-task-training">Version 2: Fake Task Training + Real Task Training&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_f_log.png" alt="task2_v2_f_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Fake Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_r_log.png" alt="task2_v2_r_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Real Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;h3 id="task-one3">Task One&lt;/h3>
&lt;ul>
&lt;li>The performance of Two Inputs RNN is just slightly better compared with that of the Two Inputs FFNN (0.5759702196 vs. 0.5751694002) while the time complexity
of the Two Inputs RNN is much higher than the Two Inputs FFNN, at some point the current version of Two Inputs RNN is resources-wasted.&lt;/li>
&lt;li>The Two Inputs CNN with a single window size performs worse than the Two Inputs FFNN and the Two Inputs RNN, and one of possible reasons is that it only looks
at one size of n-gram and hence ignores the knowledge of n-grams with different lengths.&lt;/li>
&lt;/ul>
&lt;h3 id="task-two3">Task Two&lt;/h3>
&lt;ul>
&lt;li>For different preprocessing methods, the headlines truncated version and the punctuation removal version have the same performance as the normal one except
that truncating headlines will reduce the training time for a single epoch.&lt;/li>
&lt;li>The issue of overfitting on the training dataset is hard to overcome when applying BERT-liked pre-trained LMs (Although several methods, such as data
augmentation, weight decay and dropout increase have been tried to mitigate this problem.&lt;/li>
&lt;li>Surprisingly, the fake task training for pre-trained LMs does not help to improve the performance of the model in real task even a little bit.&lt;/li>
&lt;li>With the same hyperparameters setting for the certain task, the performance of the newly proposed pre-trained LM is not necessarily the best.&lt;/li>
&lt;/ul>
&lt;h2 id="prospective">Prospective&lt;/h2>
&lt;ul>
&lt;li>Construct a pretrain LM to do a binary classification task in which the model learns to decide whether a word from the edited new headline is original or
edited. Take the embeddings out of the pretrain model and use it to initialize the model for the real regression task. By doing so we expect the embeddings can
be informed some knowledge about the relationship between original headlines and edited headlines.&lt;/li>
&lt;li>Build up a pretrain LM to do a text translation task on the training dataset and use the embeddings of this model to initialize the model for the real
regression task. (Aim to learn the semantics of funniness).&lt;/li>
&lt;li>Intuitively thinking the performance of the Two Inputs CNN might be improved by increasing the number of the window sizes (different n-gram filters).&lt;/li>
&lt;li>Applying the pre-trained LM Longformer rather than other BERT-liked models for the task two, in which the Longformer has the ‘global attention mask’ and it
can probably better model the relationship between the edited word and the other words in a headline (e.g. &lt;code>How important is the edited word for the whole headline in order to make it funnier?&lt;/code> / &lt;code>How does the edited word contribute to the meaning of the whole sentence?&lt;/code>).&lt;/li>
&lt;/ul>
&lt;h2 id="license">License&lt;/h2>
&lt;p>This project following MIT License as written in the &lt;a href="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/LICENSE">LICENSE&lt;/a> file.&lt;/p>
&lt;hr></description></item></channel></rss>