<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deep Learning | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/deep-learning/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/deep-learning/index.xml" rel="self" type="application/rss+xml"/><description>Deep Learning</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 30 Jun 2025 06:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Deep Learning</title><link>https://ziyanglin.netlify.app/en/tags/deep-learning/</link></image><item><title>TensorRT In-Depth: High-Performance Deep Learning Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</link><pubDate>Mon, 30 Jun 2025 06:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>NVIDIA® TensorRT™ is a software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It is designed to optimize and accelerate trained neural networks, enabling them to run in production environments with low latency and high throughput. TensorRT takes models from mainstream deep learning frameworks (such as TensorFlow, PyTorch, ONNX, etc.), applies a series of sophisticated optimization techniques, and generates a highly optimized runtime engine.&lt;/p>
&lt;p>This document will provide an in-depth yet accessible introduction to TensorRT's core concepts, key features, workflow, and latest functionalities (including TensorRT-LLM specifically designed for accelerating large language models), helping developers fully leverage its powerful performance advantages.&lt;/p>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>Understanding TensorRT's core components is the first step to using it effectively.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Engine&lt;/strong>: The core of TensorRT. It is an optimized model representation that includes a computation graph and weights generated for a specific GPU architecture and configuration (such as batch size, precision). The Engine is immutable and is the final product for deployment.&lt;/li>
&lt;li>&lt;strong>Builder (&lt;code>IBuilder&lt;/code>)&lt;/strong>: This is the main interface for creating an Engine. The Builder takes a network definition and applies various optimizations, ultimately generating an optimized plan for the target GPU, which can be serialized into an Engine.&lt;/li>
&lt;li>&lt;strong>Network Definition (&lt;code>INetworkDefinition&lt;/code>)&lt;/strong>: This is where you define the model structure. You can build the network manually from scratch or import it from a model file using a Parser.&lt;/li>
&lt;li>&lt;strong>Parser&lt;/strong>: Used to parse models from different frameworks (primarily ONNX format) and convert them into TensorRT's network definition. TensorRT provides a powerful ONNX parser.&lt;/li>
&lt;li>&lt;strong>Profiler (&lt;code>IProfiler&lt;/code>)&lt;/strong>: An optional interface that allows you to collect and query information about layer performance during the build process. This helps with debugging and understanding which layers are performance bottlenecks.&lt;/li>
&lt;li>&lt;strong>Execution Context (&lt;code>IExecutionContext&lt;/code>)&lt;/strong>: This is the main interface for executing inference. An Engine can have multiple Execution Contexts, allowing concurrent execution of inference tasks. Each context maintains its own inputs, outputs, and state.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Model Building Offline&amp;quot;
A[Original Model&amp;lt;br&amp;gt;TensorFlow/PyTorch] --&amp;gt; B{ONNX Parser};
B --&amp;gt; C[Network Definition];
C --&amp;gt; D[Builder];
D -- Optimization Config --&amp;gt; E[Optimized Plan];
E --&amp;gt; F((Engine));
end
subgraph &amp;quot;Inference Deployment Online&amp;quot;
F --&amp;gt; G[Execution Context];
H[Input Data] --&amp;gt; G;
G --&amp;gt; I[Output Results];
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#ccf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h2 id="3-key-features-and-optimization-techniques">3. Key Features and Optimization Techniques&lt;/h2>
&lt;p>TensorRT's high performance stems from its advanced optimization techniques.&lt;/p>
&lt;h3 id="31-precision-calibration--quantization">3.1. Precision Calibration &amp;amp; Quantization&lt;/h3>
&lt;p>TensorRT supports multiple precisions for inference, including FP32, FP16, INT8, and the latest FP8. Among these, INT8 quantization is a key technology for improving performance and reducing memory usage.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: Determines the scaling factors needed to convert FP32 weights and activation values to INT8 through a calibration dataset, without retraining the model.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: Simulates quantization operations during training, making the model more robust to quantization errors, thus achieving higher accuracy when converted to INT8.&lt;/li>
&lt;/ul>
&lt;p>You can use &lt;code>QuantizationSpec&lt;/code> to precisely control which layers or types of layers need to be quantized.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Only quantize 'Conv2D' type layers
q_spec = QuantizationSpec()
q_spec.add(name='Conv2D', is_keras_class=True)
q_model = quantize_model(model, quantization_mode='partial', quantization_spec=q_spec)
&lt;/code>&lt;/pre>
&lt;h3 id="32-layer--tensor-fusion">3.2. Layer &amp;amp; Tensor Fusion&lt;/h3>
&lt;p>TensorRT intelligently merges multiple independent layers into a single, more complex layer. This reduces the number of CUDA kernel launches and memory reads/writes, significantly lowering latency.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Vertical Fusion&lt;/strong>: Merges consecutive layers with the same data dependencies (such as Conv, Bias, ReLU) into a single CBR layer.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv);
B --&amp;gt; C(Bias);
C --&amp;gt; D(ReLU);
D --&amp;gt; E[Output];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv + Bias + ReLU));
F --&amp;gt; E2[Output];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Horizontal Fusion&lt;/strong>: Merges parallel layers that have the same input but perform different operations.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv A);
A --&amp;gt; C(Conv B);
B --&amp;gt; D[Output A];
C --&amp;gt; E[Output B];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv A + Conv B));
F --&amp;gt; D2[Output A];
F --&amp;gt; E2[Output B];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="33-kernel-autotuning">3.3. Kernel Auto-Tuning&lt;/h3>
&lt;p>For specific target GPU architectures, TensorRT selects the optimal CUDA kernel for each layer from a library containing multiple implementations. It tests different algorithms and implementations based on the current batch size, input dimensions, and parameters to find the fastest one.&lt;/p>
&lt;h3 id="34-dynamic-shapes">3.4. Dynamic Shapes&lt;/h3>
&lt;p>TensorRT can handle models with input tensor dimensions that vary at runtime. When building an Engine, you can specify an optimization profile that includes minimum, optimal, and maximum dimensions for inputs. TensorRT will generate an Engine that can efficiently handle any input dimensions within the specified range.&lt;/p>
&lt;h3 id="35-plugins">3.5. Plugins&lt;/h3>
&lt;p>For custom or special layers not natively supported by TensorRT, you can implement your own logic through the plugin API (&lt;code>IPluginV2&lt;/code>). This provides great extensibility for TensorRT.&lt;/p>
&lt;p>The latest versions of TensorRT have greatly simplified the plugin registration process through decorators, especially for the Python API.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Register a simple element-wise addition plugin
import tensorrt.plugin as trtp
@trtp.register(&amp;quot;sample::elemwise_add_plugin&amp;quot;)
def add_plugin_desc(inp0: trtp.TensorDesc, block_size: int) -&amp;gt; trtp.TensorDesc:
return inp0.like()
&lt;/code>&lt;/pre>
&lt;h3 id="36-sparsity">3.6. Sparsity&lt;/h3>
&lt;p>TensorRT supports leveraging structured sparsity features on NVIDIA Ampere and higher architecture GPUs. If your model weights have a 2:4 sparsity pattern, TensorRT can utilize sparse tensor cores to further accelerate computation, nearly doubling performance.&lt;/p>
&lt;h2 id="4-workflow">4. Workflow&lt;/h2>
&lt;p>A typical TensorRT deployment workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant TF as TensorFlow/PyTorch
participant ONNX
participant Poly as Polygraphy
participant TRT as TensorRT (trtexec/API)
participant App as Application
D-&amp;gt;&amp;gt;TF: Train Model
TF--&amp;gt;&amp;gt;D: Generate Trained Model
D-&amp;gt;&amp;gt;ONNX: Export to ONNX Format
ONNX--&amp;gt;&amp;gt;D: .onnx File
D-&amp;gt;&amp;gt;Poly: Use Polygraphy to Check and Optimize
Poly--&amp;gt;&amp;gt;D: Optimized .onnx File
D-&amp;gt;&amp;gt;TRT: Build Engine (FP16/INT8)
TRT--&amp;gt;&amp;gt;D: Generate .engine File
D-&amp;gt;&amp;gt;App: Deploy Engine
App-&amp;gt;&amp;gt;App: Load Engine and Create Execution Context
loop Inference Loop
App-&amp;gt;&amp;gt;App: Prepare Input Data
App-&amp;gt;&amp;gt;App: Execute Inference
App-&amp;gt;&amp;gt;App: Get Output Results
end
&lt;/code>&lt;/pre>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Model Export&lt;/strong>: Export your trained model from your training framework (such as PyTorch or TensorFlow) to ONNX format. ONNX is an open model exchange format that serves as a bridge between training and inference.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Model Inspection and Optimization (Polygraphy)&lt;/strong>: Before building an Engine, it is strongly recommended to use the &lt;strong>Polygraphy&lt;/strong> toolkit to inspect, modify, and optimize your ONNX model. Polygraphy is a powerful tool that can:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Inspect Models&lt;/strong>: Display information about the model's layers, inputs, outputs, etc.&lt;/li>
&lt;li>&lt;strong>Constant Folding&lt;/strong>: Pre-compute constant expressions in the model, simplifying the computation graph.
&lt;pre>&lt;code class="language-bash">polygraphy surgeon sanitize model.onnx -o folded.onnx --fold-constants
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Compare Outputs from Different Frameworks&lt;/strong>: Verify that TensorRT's output is consistent with the original framework (such as ONNX Runtime) to troubleshoot precision issues.
&lt;pre>&lt;code class="language-bash">polygraphy run model.onnx --trt --onnxrt
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Handle Data-Dependent Shapes (DDS)&lt;/strong>: Identify and set upper bounds for tensors with data-dependent shapes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Build Engine&lt;/strong>: Use the &lt;code>trtexec&lt;/code> command-line tool or TensorRT's C++/Python API to build an Engine.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>trtexec&lt;/code>&lt;/strong>: A convenient command-line tool for quickly building an Engine from an ONNX file and conducting performance benchmarking.
&lt;pre>&lt;code class="language-bash">trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>API&lt;/strong>: Provides more flexible control, such as defining optimization profiles for dynamic shapes, configuring plugins, etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Deployment and Inference&lt;/strong>: Load the serialized Engine file into your application and use an Execution Context to perform inference.&lt;/p>
&lt;pre>&lt;code class="language-python"># Using Polygraphy's TrtRunner for inference
from polygraphy.backend.trt import TrtRunner, EngineFromBytes
# Load Engine
engine = EngineFromBytes(open(&amp;quot;model.engine&amp;quot;, &amp;quot;rb&amp;quot;).read())
with TrtRunner(engine) as runner:
# Prepare input data
feed_dict = {&amp;quot;input_name&amp;quot;: input_data}
# Execute inference
outputs = runner.infer(feed_dict=feed_dict)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h2 id="5-latest-feature-highlights">5. Latest Feature Highlights&lt;/h2>
&lt;p>TensorRT is rapidly iterating, and here are some of the latest important features:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Polygraphy Tool Enhancements&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Simplified CLI Syntax&lt;/strong>: Allows specifying both script and function name in a single parameter (&lt;code>my_script.py:my_func&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Improved Input Specification&lt;/strong>: Uses a new list-style syntax (&lt;code>--input-shapes input0:[x,y,z]&lt;/code>) to avoid ambiguity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Quickly Deployable Plugins&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>The Python API has introduced &lt;a href="mailto:%60@trtp.register">`@trtp.register&lt;/a>&lt;code>and&lt;/code>@trt.plugin.autotune` decorators, making it unprecedentedly simple to define, register, and auto-tune plugins without writing C++ code.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>CUDA Graphs&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Through the &lt;code>--use-cuda-graph&lt;/code> flag, TensorRT can leverage CUDA Graphs to capture the entire inference process, further reducing CPU overhead and kernel launch latency, particularly suitable for scenarios with fixed model structures.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>FP8 Support&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>On Hopper and higher architecture GPUs, TensorRT supports FP8 inference, providing higher performance and lower memory usage for large language models and other applications.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="6-appendix-common-commands">6. Appendix: Common Commands&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Install Polygraphy&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 -m pip install polygraphy --extra-index-url https://pypi.ngc.nvidia.com
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build and Install TensorRT Open Source Components&lt;/strong>:
&lt;pre>&lt;code class="language-bash"># From source directory
make install
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run pytest Tests&lt;/strong>:
&lt;pre>&lt;code class="language-bash">pytest --verbose
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="7-tensorrtllm-born-for-large-language-model-inference">7. TensorRT-LLM: Born for Large Language Model Inference&lt;/h2>
&lt;p>As the scale and complexity of large language models (LLMs) grow exponentially, traditional inference optimization methods face unprecedented challenges. To address these challenges, NVIDIA has introduced TensorRT-LLM, an open-source library specifically designed to accelerate and optimize LLM inference. It is built on top of TensorRT and encapsulates a series of cutting-edge optimization techniques for LLMs.&lt;/p>
&lt;h3 id="71-what-is-tensorrtllm">7.1. What is TensorRT-LLM?&lt;/h3>
&lt;p>TensorRT-LLM can be thought of as an &amp;ldquo;LLM expert version&amp;rdquo; of TensorRT. It provides a Python API that allows developers to easily define LLM models and automatically apply various state-of-the-art optimizations. Ultimately, it generates a high-performance TensorRT engine that can be directly deployed.&lt;/p>
&lt;p>Unlike general TensorRT which mainly handles static graphs, TensorRT-LLM specifically addresses the dynamic characteristics in LLM inference, such as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Autoregressive Generation&lt;/strong>: Each newly generated token depends on the previous tokens, resulting in dynamically changing input sequence lengths.&lt;/li>
&lt;li>&lt;strong>Enormous Model Scale&lt;/strong>: Model parameters often number in the billions or even hundreds of billions, making it impossible to deploy on a single GPU.&lt;/li>
&lt;li>&lt;strong>Massive KV Cache&lt;/strong>: The inference process requires storing a large number of key-value pairs (Key-Value Cache), placing extremely high demands on memory bandwidth and capacity.&lt;/li>
&lt;/ul>
&lt;h3 id="72-core-architecture-and-components">7.2. Core Architecture and Components&lt;/h3>
&lt;p>TensorRT-LLM's architecture is divided into frontend and backend:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Python API (&lt;code>tensorrt_llm&lt;/code>)&lt;/strong>: This is the main interface for user interaction. It defines models in a declarative way (similar to PyTorch), allowing developers to avoid dealing with the complex underlying TensorRT C++ API.&lt;/li>
&lt;li>&lt;strong>C++ Backend&lt;/strong>: This is the core that actually performs the optimization, containing pre-written, highly optimized CUDA kernels, LLM-specific optimization passes, and a runtime that can efficiently handle LLM tasks.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Frontend (Python API)&amp;quot;
A[Hugging Face / Custom Model] --&amp;gt;|Weights| B(Model Definition&amp;lt;br&amp;gt;tensorrt_llm.Module);
B --&amp;gt; C{Builder};
C -- Generate Network and Config --&amp;gt; D[Network Definition];
end
subgraph &amp;quot;Backend (C++ Runtime)&amp;quot;
D --&amp;gt; E[TensorRT-LLM Optimization];
E --&amp;gt; F((LLM Optimized Engine));
end
subgraph &amp;quot;Inference&amp;quot;
F --&amp;gt; G[C++/Python Runtime];
H[Input Prompts] --&amp;gt; G;
G --&amp;gt; I[Output Tokens];
end
style F fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="73-key-optimization-techniques-llmspecific">7.3. Key Optimization Techniques (LLM-Specific)&lt;/h3>
&lt;p>The magic of TensorRT-LLM lies in its optimization techniques specifically designed for LLMs.&lt;/p>
&lt;h4 id="731-inflight-batching-also-known-as-continuous-batching">7.3.1. In-Flight Batching (also known as Continuous Batching)&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: Traditional static batching requires all requests to wait until a batch is formed before processing them together. Due to the varying generation lengths of each request, this leads to significant GPU idle time (&amp;ldquo;bubbles&amp;rdquo;), as the batch must wait for the slowest request to complete.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: In-Flight Batching allows the server to dynamically add new requests while the GPU is running. Once a request completes, its computational resources are immediately released and allocated to new requests in the waiting queue. This greatly improves GPU utilization and overall system throughput.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">gantt
title GPU Utilization Comparison
dateFormat X
axisFormat %S
section Static Batching
Request A: 0, 6
Request B: 0, 3
Request C: 0, 5
GPU Waiting : 3, 3
GPU Waiting : 5, 1
section In-Flight Batching
Request A : 0, 6
Request B : 0, 3
Request C : 0, 5
New Request D : 3, 4
&lt;/code>&lt;/pre>
&lt;h4 id="732-paged-kv-cache--attention">7.3.2. Paged KV Cache &amp;amp; Attention&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: In the autoregressive generation process, the KV cache grows linearly with sequence length, consuming large amounts of GPU memory. The traditional approach is to pre-allocate a continuous memory block for each request that can accommodate the maximum sequence length, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Inspired by operating system virtual memory paging, TensorRT-LLM introduced Paged KV Cache. It divides the KV cache into fixed-size &amp;ldquo;blocks&amp;rdquo; and allocates them as needed.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Non-contiguous Storage&lt;/strong>: KV caches for logically continuous tokens can be stored in physically non-contiguous blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: For complex scenarios (such as parallel sampling, Beam Search), different sequences can share the same KV cache blocks (e.g., sharing the cache for the prompt portion), significantly saving memory.&lt;/li>
&lt;li>&lt;strong>Optimized Attention Kernels&lt;/strong>: TensorRT-LLM uses specially optimized Attention kernels such as FlashAttention and MQA/GQA that can directly operate on these non-contiguous cache blocks, avoiding data copy overhead.&lt;/li>
&lt;/ul>
&lt;h4 id="733-tensor--pipeline-parallelism">7.3.3. Tensor &amp;amp; Pipeline Parallelism&lt;/h4>
&lt;p>For large models that cannot fit on a single GPU, TensorRT-LLM has built-in seamless support for tensor parallelism and pipeline parallelism. Developers only need to specify the parallelism degree (&lt;code>tp_size&lt;/code>, &lt;code>pp_size&lt;/code>) during building, and TensorRT-LLM will automatically handle model splitting and cross-GPU communication.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Example: Build a Llama model with 2-way tensor parallelism
python3 examples/llama/convert_checkpoint.py \
--model_dir ./llama-7b-hf \
--output_dir ./tllm_checkpoint_tp2 \
--dtype float16 \
--tp_size 2
&lt;/code>&lt;/pre>
&lt;h4 id="734-advanced-quantization-support-fp8int4int8">7.3.4. Advanced Quantization Support (FP8/INT4/INT8)&lt;/h4>
&lt;p>The enormous parameter count of LLMs makes them ideal candidates for quantization. TensorRT-LLM supports various advanced quantization schemes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>FP8&lt;/strong>: On NVIDIA Hopper and higher architecture GPUs, FP8 provides precision close to FP16 while significantly improving performance and reducing memory usage.&lt;/li>
&lt;li>&lt;strong>INT8 SmoothQuant&lt;/strong>: A technique that quantizes both activations and weights, achieving INT8 acceleration while maintaining high precision.&lt;/li>
&lt;li>&lt;strong>INT4/INT8 Weight-Only Quantization (W4A16/W8A16)&lt;/strong>: This is a very popular technique that only quantizes model weights (the largest part of parameters) to INT4 or INT8, while keeping activations in FP16. This greatly reduces memory usage with minimal impact on accuracy.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-bash"># Example: Build a model with INT4 weight-only quantization
python convert_checkpoint.py --model_dir ./gpt-j-6b \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./trt_ckpt/gptj_int4wo_tp1/
&lt;/code>&lt;/pre>
&lt;h3 id="74-tensorrtllm-workflow">7.4. TensorRT-LLM Workflow&lt;/h3>
&lt;p>A typical TensorRT-LLM workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant HF as Hugging Face Hub
participant Conv as convert_checkpoint.py
participant Build as trtllm-build
participant App as Inference Application (Python/C++)
D-&amp;gt;&amp;gt;HF: Download Model Weights
HF--&amp;gt;&amp;gt;D: model_dir
D-&amp;gt;&amp;gt;Conv: Run Conversion Script (Specify Precision, Parallelism, etc.)
Conv--&amp;gt;&amp;gt;D: Generate TensorRT-LLM Checkpoint
D-&amp;gt;&amp;gt;Build: Run Build Command (Specify Plugins, BatchSize, etc.)
Build--&amp;gt;&amp;gt;D: Generate Optimized .engine File
D-&amp;gt;&amp;gt;App: Load Engine and Run Inference
App--&amp;gt;&amp;gt;D: Return Generation Results
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>End-to-End Example (Using Llama-7B)&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Convert Weights&lt;/strong>:
&lt;pre>&lt;code class="language-bash">git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Llama-2-7b-hf \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build Engine&lt;/strong>:
&lt;pre>&lt;code class="language-bash">trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama_7b \
--gpt_attention_plugin float16 \
--gemm_plugin float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run Inference&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 examples/run.py --max_output_len=100 \
--tokenizer_dir ./Llama-2-7b-hf \
--engine_dir=./trt_engines/llama_7b
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h3 id="75-convenient-highlevel-api-llm">7.5. Convenient High-Level API (&lt;code>LLM&lt;/code>)&lt;/h3>
&lt;p>To further simplify the development process, TensorRT-LLM provides a high-level API called &lt;code>LLM&lt;/code>. This interface encapsulates model loading, building, saving, and inference into a simple class, allowing developers to complete all operations in just a few lines of code.&lt;/p>
&lt;pre>&lt;code class="language-python">from tensorrt_llm import LLM
# 1. Initialize LLM object, if the engine doesn't exist, it will automatically build from HuggingFace model
# All optimizations like In-Flight Batching, Paged KV-Cache will be applied here
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;,
tensor_parallel_size=1,
)
# 2. (Optional) Save the built engine for later use
llm.save(&amp;quot;llama_engine_dir&amp;quot;)
# 3. Run inference
prompt = &amp;quot;NVIDIA TensorRT-LLM is&amp;quot;
for output in llm.generate([prompt], max_new_tokens=50):
print(output)
&lt;/code>&lt;/pre>
&lt;p>This high-level API is ideal for rapid prototyping and deployment.&lt;/p>
&lt;h3 id="76-conclusion">7.6. Conclusion&lt;/h3>
&lt;p>TensorRT-LLM is not simply applying TensorRT to LLMs, but a comprehensive solution fundamentally redesigned for LLM inference, containing multiple state-of-the-art optimizations. Through In-Flight Batching, Paged KV-Cache, native parallel support, and advanced quantization schemes, it can maximize the hardware performance of NVIDIA GPUs, providing a solid foundation for deploying high-performance, high-throughput LLM services.&lt;/p></description></item><item><title>Modern ASR Technology Analysis: From Traditional Models to LLM-Driven New Paradigms</title><link>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</link><pubDate>Sat, 28 Jun 2025 13:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</guid><description>&lt;h2 id="1-background">1. Background&lt;/h2>
&lt;h3 id="11-pain-points-of-traditional-asr-models">1.1 Pain Points of Traditional ASR Models&lt;/h3>
&lt;p>Traditional Automatic Speech Recognition (ASR) models, such as those based on Hidden Markov Models-Gaussian Mixture Models (HMM-GMM) or Deep Neural Networks (DNN), perform well in specific domains and controlled environments but face numerous challenges:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Sparsity&lt;/strong>: Heavy dependence on large-scale, high-quality labeled datasets, resulting in poor generalization to low-resource languages or specific accents.&lt;/li>
&lt;li>&lt;strong>Insufficient Robustness&lt;/strong>: Performance drops dramatically in noisy environments, far-field audio capture, multi-person conversations, and other real-world scenarios.&lt;/li>
&lt;li>&lt;strong>Lack of Contextual Understanding&lt;/strong>: Models are typically limited to direct mapping from acoustic features to text, lacking understanding of long-range context, semantics, and speaker intent, leading to recognition errors (such as homophone confusion).&lt;/li>
&lt;li>&lt;strong>Limited Multi-task Capabilities&lt;/strong>: Traditional models are usually single-task oriented, supporting only speech transcription without simultaneously handling speaker diarization, language identification, translation, and other tasks.&lt;/li>
&lt;/ol>
&lt;h3 id="12-large-language-model-llm-driven-asr-new-paradigm">1.2 Large Language Model (LLM) Driven ASR New Paradigm&lt;/h3>
&lt;p>In recent years, end-to-end large ASR models represented by &lt;code>Whisper&lt;/code> have demonstrated unprecedented robustness and generalization capabilities through pretraining on massive, diverse unsupervised or weakly supervised data. These models typically adopt an Encoder-Decoder architecture, treating ASR as a sequence-to-sequence translation problem.&lt;/p>
&lt;p>&lt;strong>Typical Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Raw Audio Waveform&amp;quot;] --&amp;gt; B[&amp;quot;Feature Extraction (e.g., Log-Mel Spectrogram)&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Sequence Output&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>This approach not only simplifies the complex pipeline of traditional ASR but also learns rich acoustic and linguistic knowledge through large-scale data, enabling excellent performance even in zero-shot scenarios.&lt;/p>
&lt;h2 id="2-analysis-of-asr-model-solutions">2. Analysis of ASR Model Solutions&lt;/h2>
&lt;h3 id="21-whisperlargev3turbo">2.1 Whisper-large-v3-turbo&lt;/h3>
&lt;p>&lt;code>Whisper&lt;/code> is a pretrained ASR model developed by OpenAI, with its &lt;code>large-v3&lt;/code> and &lt;code>large-v3-turbo&lt;/code> versions being among the industry-leading models.&lt;/p>
&lt;h4 id="211-whisper-design">2.1.1 Whisper Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Input (30s segment)&amp;quot;] --&amp;gt; B[&amp;quot;Log-Mel Spectrogram&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Encoded Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Predicted Text Tokens&amp;quot;]
subgraph &amp;quot;Multi-task Processing&amp;quot;
E --&amp;gt; G[&amp;quot;Transcription&amp;quot;]
E --&amp;gt; H[&amp;quot;Translation&amp;quot;]
E --&amp;gt; I[&amp;quot;Language Identification&amp;quot;]
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Large-scale Weakly Supervised Training&lt;/strong>: Trained on 680,000 hours of multilingual, multi-task data, covering a wide range of accents, background noise, and technical terminology.&lt;/li>
&lt;li>&lt;strong>End-to-end Architecture&lt;/strong>: A unified Transformer model directly maps audio to text, without requiring external language models or alignment modules.&lt;/li>
&lt;li>&lt;strong>Multi-task Capability&lt;/strong>: The model can simultaneously handle multilingual speech transcription, speech translation, and language identification.&lt;/li>
&lt;li>&lt;strong>Robustness&lt;/strong>: Through carefully designed data augmentation and mixing, the model performs excellently under various challenging conditions.&lt;/li>
&lt;li>&lt;strong>Turbo Version&lt;/strong>: &lt;code>large-v3-turbo&lt;/code> is an optimized version of &lt;code>large-v3&lt;/code>, potentially offering improvements in inference speed, computational efficiency, or specific task performance, with approximately 798M parameters.&lt;/li>
&lt;/ul>
&lt;h4 id="212-problems-solved">2.1.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>Whisper's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Poor Generalization&lt;/td>
&lt;td>Large-scale pretraining on massive, diverse datasets covering nearly a hundred languages.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Insufficient Robustness&lt;/td>
&lt;td>Training data includes various background noise, accents, and speaking styles, enhancing performance in real-world scenarios.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Weak Contextual Modeling&lt;/td>
&lt;td>Transformer architecture captures long-range dependencies in audio signals.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Complex Deployment&lt;/td>
&lt;td>Provides multiple model sizes (from &lt;code>tiny&lt;/code> to &lt;code>large&lt;/code>), with open-sourced code and model weights, facilitating community use and deployment.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="213-production-defect-analysis">2.1.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2131-hallucination-issues">2.1.3.1 Hallucination Issues&lt;/h5>
&lt;ul>
&lt;li>In segments with no speech or noise, the model sometimes generates meaningless or repetitive text, a common issue with large autoregressive models.&lt;/li>
&lt;li>This phenomenon is particularly noticeable in long audio processing and may require additional post-processing logic for detection and filtering.&lt;/li>
&lt;/ul>
&lt;h5 id="2132-limited-timestamp-precision">2.1.3.2 Limited Timestamp Precision&lt;/h5>
&lt;ul>
&lt;li>The model predicts word-level timestamps, but their precision may not meet the stringent requirements of certain applications (such as subtitle alignment, speech editing).&lt;/li>
&lt;li>Timestamp accuracy decreases during long periods of silence or rapid speech flow.&lt;/li>
&lt;/ul>
&lt;h5 id="2133-high-computational-resource-requirements">2.1.3.3 High Computational Resource Requirements&lt;/h5>
&lt;ul>
&lt;li>The &lt;code>large-v3&lt;/code> model contains 1.55 billion parameters, and the &lt;code>turbo&lt;/code> version has nearly 800 million parameters, demanding significant computational resources (especially GPU memory), making it unsuitable for direct execution on edge devices.&lt;/li>
&lt;li>Although optimization techniques like quantization exist, balancing performance while reducing resource consumption remains a challenge.&lt;/li>
&lt;/ul>
&lt;h5 id="2134-realtime-processing-bottlenecks">2.1.3.4 Real-time Processing Bottlenecks&lt;/h5>
&lt;ul>
&lt;li>The model processes 30-second audio windows, requiring complex sliding window and caching mechanisms for real-time streaming ASR scenarios, which introduces additional latency.&lt;/li>
&lt;/ul>
&lt;h3 id="22-sensevoice">2.2 SenseVoice&lt;/h3>
&lt;p>&lt;code>SenseVoice&lt;/code> is a next-generation industrial-grade ASR model developed by Alibaba DAMO Academy's speech team. Unlike &lt;code>Whisper&lt;/code>, which focuses on robust general transcription, &lt;code>SenseVoice&lt;/code> emphasizes multi-functionality, real-time processing, and integration with downstream tasks.&lt;/p>
&lt;h4 id="221-sensevoice-design">2.2.1 SenseVoice Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Stream&amp;quot;] --&amp;gt; B[&amp;quot;FSMN-VAD (Voice Activity Detection)&amp;quot;]
B --&amp;gt; C[&amp;quot;Encoder (e.g., SAN-M)&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Output&amp;quot;]
subgraph &amp;quot;Multi-task and Control&amp;quot;
G[&amp;quot;Speaker Diarization&amp;quot;] --&amp;gt; C
H[&amp;quot;Emotion Recognition&amp;quot;] --&amp;gt; C
I[&amp;quot;Zero-shot TTS Prompt&amp;quot;] --&amp;gt; E
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified End-to-end Model&lt;/strong>: Integrates acoustic model, language model, and punctuation prediction, achieving end-to-end output from speech to punctuated text.&lt;/li>
&lt;li>&lt;strong>Multi-task Learning&lt;/strong>: The model not only performs speech recognition but also simultaneously outputs speaker diarization, emotional information, and can even generate acoustic prompts for zero-shot TTS.&lt;/li>
&lt;li>&lt;strong>Streaming and Non-streaming Integration&lt;/strong>: Supports both streaming and non-streaming modes through a unified architecture, meeting the needs of real-time and offline scenarios.&lt;/li>
&lt;li>&lt;strong>TTS Integration&lt;/strong>: One innovation of &lt;code>SenseVoice&lt;/code> is that its output can serve as a prompt for TTS models like &lt;code>CosyVoice&lt;/code>, enabling voice cloning and transfer, closing the loop between ASR and TTS.&lt;/li>
&lt;/ul>
&lt;h4 id="222-problems-solved">2.2.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>SenseVoice's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Single-task Limitation, Integration Difficulties&lt;/td>
&lt;td>Designed as a multi-task model, natively supporting speaker diarization, emotion recognition, etc., simplifying dialogue system construction.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor Real-time Performance&lt;/td>
&lt;td>Adopts efficient streaming architecture (such as SAN-M), combined with VAD, achieving low-latency real-time recognition.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of Coordination with Downstream Tasks&lt;/td>
&lt;td>Output includes rich meta-information (such as speaker, emotion) and can generate TTS prompts, achieving deep integration between ASR and TTS.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Punctuation Restoration Dependent on Post-processing&lt;/td>
&lt;td>Incorporates punctuation prediction as a built-in task, achieving joint modeling of text and punctuation.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="223-production-defect-analysis">2.2.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2231-model-complexity-and-maintenance">2.2.3.1 Model Complexity and Maintenance&lt;/h5>
&lt;ul>
&lt;li>As a complex model integrating multiple functions, its training and maintenance costs are relatively high.&lt;/li>
&lt;li>Balancing multiple tasks may require fine-tuning to avoid performance degradation in any single task.&lt;/li>
&lt;/ul>
&lt;h5 id="2232-generalization-of-zeroshot-capabilities">2.2.3.2 Generalization of Zero-shot Capabilities&lt;/h5>
&lt;ul>
&lt;li>Although it supports zero-shot TTS prompt generation, its voice cloning effect and stability when facing unseen speakers or complex acoustic environments may not match specialized voice cloning models.&lt;/li>
&lt;/ul>
&lt;h5 id="2233-opensource-ecosystem-and-community">2.2.3.3 Open-source Ecosystem and Community&lt;/h5>
&lt;ul>
&lt;li>Compared to &lt;code>Whisper&lt;/code>'s strong open-source community and rich ecosystem tools, &lt;code>SenseVoice&lt;/code>, as an industrial-grade model, may have limited open-source availability and community support, affecting its popularity in academic and developer communities.&lt;/li>
&lt;/ul>
&lt;h2 id="3-conclusion">3. Conclusion&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Whisper&lt;/strong>: Through large-scale weakly supervised learning, it has pushed the robustness and generalization capabilities of ASR to new heights. It is a powerful &lt;strong>general-purpose speech recognizer&lt;/strong>, particularly suitable for processing diverse, uncontrolled audio data. Its design philosophy is &amp;ldquo;trading scale for performance,&amp;rdquo; excelling in zero-shot and multilingual scenarios.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SenseVoice&lt;/strong>: Represents the trend of ASR technology developing towards &lt;strong>multi-functionality and integration&lt;/strong>. It is not just a recognizer but a &lt;strong>perceptual frontend for conversational intelligence&lt;/strong>, aimed at providing richer, more real-time input for downstream tasks (such as dialogue systems, TTS). Its design philosophy is &amp;ldquo;fusion and collaboration,&amp;rdquo; emphasizing ASR's pivotal role in the entire intelligent interaction chain.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>In summary, &lt;code>Whisper&lt;/code> defines the performance baseline for modern ASR, while &lt;code>SenseVoice&lt;/code> explores broader possibilities for ASR in industrial applications. Future ASR technology may develop towards combining the strengths of both: having both the robustness and generalization capabilities of &lt;code>Whisper&lt;/code> and the multi-task collaboration and real-time processing capabilities of &lt;code>SenseVoice&lt;/code>.&lt;/p></description></item></channel></rss>