<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Model Inference | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/model-inference/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/model-inference/index.xml" rel="self" type="application/rss+xml"/><description>Model Inference</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 30 Jun 2025 06:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Model Inference</title><link>https://ziyanglin.netlify.app/en/tags/model-inference/</link></image><item><title>TensorRT In-Depth: High-Performance Deep Learning Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</link><pubDate>Mon, 30 Jun 2025 06:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>NVIDIA® TensorRT™ is a software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It is designed to optimize and accelerate trained neural networks, enabling them to run in production environments with low latency and high throughput. TensorRT takes models from mainstream deep learning frameworks (such as TensorFlow, PyTorch, ONNX, etc.), applies a series of sophisticated optimization techniques, and generates a highly optimized runtime engine.&lt;/p>
&lt;p>This document will provide an in-depth yet accessible introduction to TensorRT's core concepts, key features, workflow, and latest functionalities (including TensorRT-LLM specifically designed for accelerating large language models), helping developers fully leverage its powerful performance advantages.&lt;/p>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>Understanding TensorRT's core components is the first step to using it effectively.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Engine&lt;/strong>: The core of TensorRT. It is an optimized model representation that includes a computation graph and weights generated for a specific GPU architecture and configuration (such as batch size, precision). The Engine is immutable and is the final product for deployment.&lt;/li>
&lt;li>&lt;strong>Builder (&lt;code>IBuilder&lt;/code>)&lt;/strong>: This is the main interface for creating an Engine. The Builder takes a network definition and applies various optimizations, ultimately generating an optimized plan for the target GPU, which can be serialized into an Engine.&lt;/li>
&lt;li>&lt;strong>Network Definition (&lt;code>INetworkDefinition&lt;/code>)&lt;/strong>: This is where you define the model structure. You can build the network manually from scratch or import it from a model file using a Parser.&lt;/li>
&lt;li>&lt;strong>Parser&lt;/strong>: Used to parse models from different frameworks (primarily ONNX format) and convert them into TensorRT's network definition. TensorRT provides a powerful ONNX parser.&lt;/li>
&lt;li>&lt;strong>Profiler (&lt;code>IProfiler&lt;/code>)&lt;/strong>: An optional interface that allows you to collect and query information about layer performance during the build process. This helps with debugging and understanding which layers are performance bottlenecks.&lt;/li>
&lt;li>&lt;strong>Execution Context (&lt;code>IExecutionContext&lt;/code>)&lt;/strong>: This is the main interface for executing inference. An Engine can have multiple Execution Contexts, allowing concurrent execution of inference tasks. Each context maintains its own inputs, outputs, and state.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Model Building Offline&amp;quot;
A[Original Model&amp;lt;br&amp;gt;TensorFlow/PyTorch] --&amp;gt; B{ONNX Parser};
B --&amp;gt; C[Network Definition];
C --&amp;gt; D[Builder];
D -- Optimization Config --&amp;gt; E[Optimized Plan];
E --&amp;gt; F((Engine));
end
subgraph &amp;quot;Inference Deployment Online&amp;quot;
F --&amp;gt; G[Execution Context];
H[Input Data] --&amp;gt; G;
G --&amp;gt; I[Output Results];
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#ccf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h2 id="3-key-features-and-optimization-techniques">3. Key Features and Optimization Techniques&lt;/h2>
&lt;p>TensorRT's high performance stems from its advanced optimization techniques.&lt;/p>
&lt;h3 id="31-precision-calibration--quantization">3.1. Precision Calibration &amp;amp; Quantization&lt;/h3>
&lt;p>TensorRT supports multiple precisions for inference, including FP32, FP16, INT8, and the latest FP8. Among these, INT8 quantization is a key technology for improving performance and reducing memory usage.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: Determines the scaling factors needed to convert FP32 weights and activation values to INT8 through a calibration dataset, without retraining the model.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: Simulates quantization operations during training, making the model more robust to quantization errors, thus achieving higher accuracy when converted to INT8.&lt;/li>
&lt;/ul>
&lt;p>You can use &lt;code>QuantizationSpec&lt;/code> to precisely control which layers or types of layers need to be quantized.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Only quantize 'Conv2D' type layers
q_spec = QuantizationSpec()
q_spec.add(name='Conv2D', is_keras_class=True)
q_model = quantize_model(model, quantization_mode='partial', quantization_spec=q_spec)
&lt;/code>&lt;/pre>
&lt;h3 id="32-layer--tensor-fusion">3.2. Layer &amp;amp; Tensor Fusion&lt;/h3>
&lt;p>TensorRT intelligently merges multiple independent layers into a single, more complex layer. This reduces the number of CUDA kernel launches and memory reads/writes, significantly lowering latency.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Vertical Fusion&lt;/strong>: Merges consecutive layers with the same data dependencies (such as Conv, Bias, ReLU) into a single CBR layer.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv);
B --&amp;gt; C(Bias);
C --&amp;gt; D(ReLU);
D --&amp;gt; E[Output];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv + Bias + ReLU));
F --&amp;gt; E2[Output];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Horizontal Fusion&lt;/strong>: Merges parallel layers that have the same input but perform different operations.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv A);
A --&amp;gt; C(Conv B);
B --&amp;gt; D[Output A];
C --&amp;gt; E[Output B];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv A + Conv B));
F --&amp;gt; D2[Output A];
F --&amp;gt; E2[Output B];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="33-kernel-autotuning">3.3. Kernel Auto-Tuning&lt;/h3>
&lt;p>For specific target GPU architectures, TensorRT selects the optimal CUDA kernel for each layer from a library containing multiple implementations. It tests different algorithms and implementations based on the current batch size, input dimensions, and parameters to find the fastest one.&lt;/p>
&lt;h3 id="34-dynamic-shapes">3.4. Dynamic Shapes&lt;/h3>
&lt;p>TensorRT can handle models with input tensor dimensions that vary at runtime. When building an Engine, you can specify an optimization profile that includes minimum, optimal, and maximum dimensions for inputs. TensorRT will generate an Engine that can efficiently handle any input dimensions within the specified range.&lt;/p>
&lt;h3 id="35-plugins">3.5. Plugins&lt;/h3>
&lt;p>For custom or special layers not natively supported by TensorRT, you can implement your own logic through the plugin API (&lt;code>IPluginV2&lt;/code>). This provides great extensibility for TensorRT.&lt;/p>
&lt;p>The latest versions of TensorRT have greatly simplified the plugin registration process through decorators, especially for the Python API.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Register a simple element-wise addition plugin
import tensorrt.plugin as trtp
@trtp.register(&amp;quot;sample::elemwise_add_plugin&amp;quot;)
def add_plugin_desc(inp0: trtp.TensorDesc, block_size: int) -&amp;gt; trtp.TensorDesc:
return inp0.like()
&lt;/code>&lt;/pre>
&lt;h3 id="36-sparsity">3.6. Sparsity&lt;/h3>
&lt;p>TensorRT supports leveraging structured sparsity features on NVIDIA Ampere and higher architecture GPUs. If your model weights have a 2:4 sparsity pattern, TensorRT can utilize sparse tensor cores to further accelerate computation, nearly doubling performance.&lt;/p>
&lt;h2 id="4-workflow">4. Workflow&lt;/h2>
&lt;p>A typical TensorRT deployment workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant TF as TensorFlow/PyTorch
participant ONNX
participant Poly as Polygraphy
participant TRT as TensorRT (trtexec/API)
participant App as Application
D-&amp;gt;&amp;gt;TF: Train Model
TF--&amp;gt;&amp;gt;D: Generate Trained Model
D-&amp;gt;&amp;gt;ONNX: Export to ONNX Format
ONNX--&amp;gt;&amp;gt;D: .onnx File
D-&amp;gt;&amp;gt;Poly: Use Polygraphy to Check and Optimize
Poly--&amp;gt;&amp;gt;D: Optimized .onnx File
D-&amp;gt;&amp;gt;TRT: Build Engine (FP16/INT8)
TRT--&amp;gt;&amp;gt;D: Generate .engine File
D-&amp;gt;&amp;gt;App: Deploy Engine
App-&amp;gt;&amp;gt;App: Load Engine and Create Execution Context
loop Inference Loop
App-&amp;gt;&amp;gt;App: Prepare Input Data
App-&amp;gt;&amp;gt;App: Execute Inference
App-&amp;gt;&amp;gt;App: Get Output Results
end
&lt;/code>&lt;/pre>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Model Export&lt;/strong>: Export your trained model from your training framework (such as PyTorch or TensorFlow) to ONNX format. ONNX is an open model exchange format that serves as a bridge between training and inference.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Model Inspection and Optimization (Polygraphy)&lt;/strong>: Before building an Engine, it is strongly recommended to use the &lt;strong>Polygraphy&lt;/strong> toolkit to inspect, modify, and optimize your ONNX model. Polygraphy is a powerful tool that can:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Inspect Models&lt;/strong>: Display information about the model's layers, inputs, outputs, etc.&lt;/li>
&lt;li>&lt;strong>Constant Folding&lt;/strong>: Pre-compute constant expressions in the model, simplifying the computation graph.
&lt;pre>&lt;code class="language-bash">polygraphy surgeon sanitize model.onnx -o folded.onnx --fold-constants
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Compare Outputs from Different Frameworks&lt;/strong>: Verify that TensorRT's output is consistent with the original framework (such as ONNX Runtime) to troubleshoot precision issues.
&lt;pre>&lt;code class="language-bash">polygraphy run model.onnx --trt --onnxrt
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Handle Data-Dependent Shapes (DDS)&lt;/strong>: Identify and set upper bounds for tensors with data-dependent shapes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Build Engine&lt;/strong>: Use the &lt;code>trtexec&lt;/code> command-line tool or TensorRT's C++/Python API to build an Engine.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>trtexec&lt;/code>&lt;/strong>: A convenient command-line tool for quickly building an Engine from an ONNX file and conducting performance benchmarking.
&lt;pre>&lt;code class="language-bash">trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>API&lt;/strong>: Provides more flexible control, such as defining optimization profiles for dynamic shapes, configuring plugins, etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Deployment and Inference&lt;/strong>: Load the serialized Engine file into your application and use an Execution Context to perform inference.&lt;/p>
&lt;pre>&lt;code class="language-python"># Using Polygraphy's TrtRunner for inference
from polygraphy.backend.trt import TrtRunner, EngineFromBytes
# Load Engine
engine = EngineFromBytes(open(&amp;quot;model.engine&amp;quot;, &amp;quot;rb&amp;quot;).read())
with TrtRunner(engine) as runner:
# Prepare input data
feed_dict = {&amp;quot;input_name&amp;quot;: input_data}
# Execute inference
outputs = runner.infer(feed_dict=feed_dict)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h2 id="5-latest-feature-highlights">5. Latest Feature Highlights&lt;/h2>
&lt;p>TensorRT is rapidly iterating, and here are some of the latest important features:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Polygraphy Tool Enhancements&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Simplified CLI Syntax&lt;/strong>: Allows specifying both script and function name in a single parameter (&lt;code>my_script.py:my_func&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Improved Input Specification&lt;/strong>: Uses a new list-style syntax (&lt;code>--input-shapes input0:[x,y,z]&lt;/code>) to avoid ambiguity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Quickly Deployable Plugins&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>The Python API has introduced &lt;a href="mailto:%60@trtp.register">`@trtp.register&lt;/a>&lt;code>and&lt;/code>@trt.plugin.autotune` decorators, making it unprecedentedly simple to define, register, and auto-tune plugins without writing C++ code.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>CUDA Graphs&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Through the &lt;code>--use-cuda-graph&lt;/code> flag, TensorRT can leverage CUDA Graphs to capture the entire inference process, further reducing CPU overhead and kernel launch latency, particularly suitable for scenarios with fixed model structures.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>FP8 Support&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>On Hopper and higher architecture GPUs, TensorRT supports FP8 inference, providing higher performance and lower memory usage for large language models and other applications.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="6-appendix-common-commands">6. Appendix: Common Commands&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Install Polygraphy&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 -m pip install polygraphy --extra-index-url https://pypi.ngc.nvidia.com
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build and Install TensorRT Open Source Components&lt;/strong>:
&lt;pre>&lt;code class="language-bash"># From source directory
make install
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run pytest Tests&lt;/strong>:
&lt;pre>&lt;code class="language-bash">pytest --verbose
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="7-tensorrtllm-born-for-large-language-model-inference">7. TensorRT-LLM: Born for Large Language Model Inference&lt;/h2>
&lt;p>As the scale and complexity of large language models (LLMs) grow exponentially, traditional inference optimization methods face unprecedented challenges. To address these challenges, NVIDIA has introduced TensorRT-LLM, an open-source library specifically designed to accelerate and optimize LLM inference. It is built on top of TensorRT and encapsulates a series of cutting-edge optimization techniques for LLMs.&lt;/p>
&lt;h3 id="71-what-is-tensorrtllm">7.1. What is TensorRT-LLM?&lt;/h3>
&lt;p>TensorRT-LLM can be thought of as an &amp;ldquo;LLM expert version&amp;rdquo; of TensorRT. It provides a Python API that allows developers to easily define LLM models and automatically apply various state-of-the-art optimizations. Ultimately, it generates a high-performance TensorRT engine that can be directly deployed.&lt;/p>
&lt;p>Unlike general TensorRT which mainly handles static graphs, TensorRT-LLM specifically addresses the dynamic characteristics in LLM inference, such as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Autoregressive Generation&lt;/strong>: Each newly generated token depends on the previous tokens, resulting in dynamically changing input sequence lengths.&lt;/li>
&lt;li>&lt;strong>Enormous Model Scale&lt;/strong>: Model parameters often number in the billions or even hundreds of billions, making it impossible to deploy on a single GPU.&lt;/li>
&lt;li>&lt;strong>Massive KV Cache&lt;/strong>: The inference process requires storing a large number of key-value pairs (Key-Value Cache), placing extremely high demands on memory bandwidth and capacity.&lt;/li>
&lt;/ul>
&lt;h3 id="72-core-architecture-and-components">7.2. Core Architecture and Components&lt;/h3>
&lt;p>TensorRT-LLM's architecture is divided into frontend and backend:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Python API (&lt;code>tensorrt_llm&lt;/code>)&lt;/strong>: This is the main interface for user interaction. It defines models in a declarative way (similar to PyTorch), allowing developers to avoid dealing with the complex underlying TensorRT C++ API.&lt;/li>
&lt;li>&lt;strong>C++ Backend&lt;/strong>: This is the core that actually performs the optimization, containing pre-written, highly optimized CUDA kernels, LLM-specific optimization passes, and a runtime that can efficiently handle LLM tasks.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Frontend (Python API)&amp;quot;
A[Hugging Face / Custom Model] --&amp;gt;|Weights| B(Model Definition&amp;lt;br&amp;gt;tensorrt_llm.Module);
B --&amp;gt; C{Builder};
C -- Generate Network and Config --&amp;gt; D[Network Definition];
end
subgraph &amp;quot;Backend (C++ Runtime)&amp;quot;
D --&amp;gt; E[TensorRT-LLM Optimization];
E --&amp;gt; F((LLM Optimized Engine));
end
subgraph &amp;quot;Inference&amp;quot;
F --&amp;gt; G[C++/Python Runtime];
H[Input Prompts] --&amp;gt; G;
G --&amp;gt; I[Output Tokens];
end
style F fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="73-key-optimization-techniques-llmspecific">7.3. Key Optimization Techniques (LLM-Specific)&lt;/h3>
&lt;p>The magic of TensorRT-LLM lies in its optimization techniques specifically designed for LLMs.&lt;/p>
&lt;h4 id="731-inflight-batching-also-known-as-continuous-batching">7.3.1. In-Flight Batching (also known as Continuous Batching)&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: Traditional static batching requires all requests to wait until a batch is formed before processing them together. Due to the varying generation lengths of each request, this leads to significant GPU idle time (&amp;ldquo;bubbles&amp;rdquo;), as the batch must wait for the slowest request to complete.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: In-Flight Batching allows the server to dynamically add new requests while the GPU is running. Once a request completes, its computational resources are immediately released and allocated to new requests in the waiting queue. This greatly improves GPU utilization and overall system throughput.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">gantt
title GPU Utilization Comparison
dateFormat X
axisFormat %S
section Static Batching
Request A: 0, 6
Request B: 0, 3
Request C: 0, 5
GPU Waiting : 3, 3
GPU Waiting : 5, 1
section In-Flight Batching
Request A : 0, 6
Request B : 0, 3
Request C : 0, 5
New Request D : 3, 4
&lt;/code>&lt;/pre>
&lt;h4 id="732-paged-kv-cache--attention">7.3.2. Paged KV Cache &amp;amp; Attention&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: In the autoregressive generation process, the KV cache grows linearly with sequence length, consuming large amounts of GPU memory. The traditional approach is to pre-allocate a continuous memory block for each request that can accommodate the maximum sequence length, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Inspired by operating system virtual memory paging, TensorRT-LLM introduced Paged KV Cache. It divides the KV cache into fixed-size &amp;ldquo;blocks&amp;rdquo; and allocates them as needed.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Non-contiguous Storage&lt;/strong>: KV caches for logically continuous tokens can be stored in physically non-contiguous blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: For complex scenarios (such as parallel sampling, Beam Search), different sequences can share the same KV cache blocks (e.g., sharing the cache for the prompt portion), significantly saving memory.&lt;/li>
&lt;li>&lt;strong>Optimized Attention Kernels&lt;/strong>: TensorRT-LLM uses specially optimized Attention kernels such as FlashAttention and MQA/GQA that can directly operate on these non-contiguous cache blocks, avoiding data copy overhead.&lt;/li>
&lt;/ul>
&lt;h4 id="733-tensor--pipeline-parallelism">7.3.3. Tensor &amp;amp; Pipeline Parallelism&lt;/h4>
&lt;p>For large models that cannot fit on a single GPU, TensorRT-LLM has built-in seamless support for tensor parallelism and pipeline parallelism. Developers only need to specify the parallelism degree (&lt;code>tp_size&lt;/code>, &lt;code>pp_size&lt;/code>) during building, and TensorRT-LLM will automatically handle model splitting and cross-GPU communication.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Example: Build a Llama model with 2-way tensor parallelism
python3 examples/llama/convert_checkpoint.py \
--model_dir ./llama-7b-hf \
--output_dir ./tllm_checkpoint_tp2 \
--dtype float16 \
--tp_size 2
&lt;/code>&lt;/pre>
&lt;h4 id="734-advanced-quantization-support-fp8int4int8">7.3.4. Advanced Quantization Support (FP8/INT4/INT8)&lt;/h4>
&lt;p>The enormous parameter count of LLMs makes them ideal candidates for quantization. TensorRT-LLM supports various advanced quantization schemes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>FP8&lt;/strong>: On NVIDIA Hopper and higher architecture GPUs, FP8 provides precision close to FP16 while significantly improving performance and reducing memory usage.&lt;/li>
&lt;li>&lt;strong>INT8 SmoothQuant&lt;/strong>: A technique that quantizes both activations and weights, achieving INT8 acceleration while maintaining high precision.&lt;/li>
&lt;li>&lt;strong>INT4/INT8 Weight-Only Quantization (W4A16/W8A16)&lt;/strong>: This is a very popular technique that only quantizes model weights (the largest part of parameters) to INT4 or INT8, while keeping activations in FP16. This greatly reduces memory usage with minimal impact on accuracy.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-bash"># Example: Build a model with INT4 weight-only quantization
python convert_checkpoint.py --model_dir ./gpt-j-6b \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./trt_ckpt/gptj_int4wo_tp1/
&lt;/code>&lt;/pre>
&lt;h3 id="74-tensorrtllm-workflow">7.4. TensorRT-LLM Workflow&lt;/h3>
&lt;p>A typical TensorRT-LLM workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant HF as Hugging Face Hub
participant Conv as convert_checkpoint.py
participant Build as trtllm-build
participant App as Inference Application (Python/C++)
D-&amp;gt;&amp;gt;HF: Download Model Weights
HF--&amp;gt;&amp;gt;D: model_dir
D-&amp;gt;&amp;gt;Conv: Run Conversion Script (Specify Precision, Parallelism, etc.)
Conv--&amp;gt;&amp;gt;D: Generate TensorRT-LLM Checkpoint
D-&amp;gt;&amp;gt;Build: Run Build Command (Specify Plugins, BatchSize, etc.)
Build--&amp;gt;&amp;gt;D: Generate Optimized .engine File
D-&amp;gt;&amp;gt;App: Load Engine and Run Inference
App--&amp;gt;&amp;gt;D: Return Generation Results
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>End-to-End Example (Using Llama-7B)&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Convert Weights&lt;/strong>:
&lt;pre>&lt;code class="language-bash">git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Llama-2-7b-hf \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build Engine&lt;/strong>:
&lt;pre>&lt;code class="language-bash">trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama_7b \
--gpt_attention_plugin float16 \
--gemm_plugin float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run Inference&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 examples/run.py --max_output_len=100 \
--tokenizer_dir ./Llama-2-7b-hf \
--engine_dir=./trt_engines/llama_7b
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h3 id="75-convenient-highlevel-api-llm">7.5. Convenient High-Level API (&lt;code>LLM&lt;/code>)&lt;/h3>
&lt;p>To further simplify the development process, TensorRT-LLM provides a high-level API called &lt;code>LLM&lt;/code>. This interface encapsulates model loading, building, saving, and inference into a simple class, allowing developers to complete all operations in just a few lines of code.&lt;/p>
&lt;pre>&lt;code class="language-python">from tensorrt_llm import LLM
# 1. Initialize LLM object, if the engine doesn't exist, it will automatically build from HuggingFace model
# All optimizations like In-Flight Batching, Paged KV-Cache will be applied here
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;,
tensor_parallel_size=1,
)
# 2. (Optional) Save the built engine for later use
llm.save(&amp;quot;llama_engine_dir&amp;quot;)
# 3. Run inference
prompt = &amp;quot;NVIDIA TensorRT-LLM is&amp;quot;
for output in llm.generate([prompt], max_new_tokens=50):
print(output)
&lt;/code>&lt;/pre>
&lt;p>This high-level API is ideal for rapid prototyping and deployment.&lt;/p>
&lt;h3 id="76-conclusion">7.6. Conclusion&lt;/h3>
&lt;p>TensorRT-LLM is not simply applying TensorRT to LLMs, but a comprehensive solution fundamentally redesigned for LLM inference, containing multiple state-of-the-art optimizations. Through In-Flight Batching, Paged KV-Cache, native parallel support, and advanced quantization schemes, it can maximize the hardware performance of NVIDIA GPUs, providing a solid foundation for deploying high-performance, high-throughput LLM services.&lt;/p></description></item><item><title>SGLang Technical Guide: High-Performance Structured Generation Framework</title><link>https://ziyanglin.netlify.app/en/post/sglang-documentation/</link><pubDate>Thu, 26 Jun 2025 01:07:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/sglang-documentation/</guid><description>&lt;h2 id="1-sglang-introduction">1. SGLang Introduction&lt;/h2>
&lt;p>SGLang (Structured Generation Language) is a high-performance service framework designed for large language models (LLMs) and vision language models (VLMs). Its core goal is to address the challenges faced by complex LLM programs in real-world applications, maximizing inference performance while maintaining flexibility.&lt;/p>
&lt;p>Traditional LLM service frameworks (like vLLM) excel at handling simple, one-shot prompting but face limitations in complex scenarios requiring multi-turn interactions, structured outputs, function calls, or control flow. SGLang effectively bridges this gap by introducing a novel frontend language and an efficient backend runtime.&lt;/p>
&lt;p>&lt;strong>Core advantages of SGLang include:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Exceptional Performance:&lt;/strong> SGLang introduces &lt;strong>RadixAttention&lt;/strong>, an innovative attention mechanism that automatically and losslessly reuses key-value caches (KV Cache), significantly improving inference speed in scenarios with complex prompts (like CoT, ReAct) or multi-turn conversations. Compared to leading frameworks like vLLM, SGLang can achieve several times higher throughput in these scenarios.&lt;/li>
&lt;li>&lt;strong>Powerful Programming Capabilities:&lt;/strong> SGLang provides an intuitive domain-specific language (DSL) that allows developers to orchestrate complex generation tasks in a Pythonic way. You can easily define variables, use loops and conditional statements, call external tools, and seamlessly integrate these logic elements with the LLM's generation process. This makes building complex AI agents, multi-turn dialogue systems, and structured data extraction tasks unprecedentedly simple.&lt;/li>
&lt;li>&lt;strong>Unified Frontend-Backend Interface:&lt;/strong> SGLang decouples frontend programming logic from backend inference services. The frontend defines &amp;ldquo;what to generate,&amp;rdquo; while the backend handles &amp;ldquo;how to efficiently generate it.&amp;rdquo; This design not only simplifies the development process but also makes SGLang compatible with OpenAI's API standards, allowing users to easily migrate existing applications to SGLang and immediately benefit from performance gains.&lt;/li>
&lt;li>&lt;strong>Flexible Structured Output:&lt;/strong> SGLang provides powerful structured output constraint capabilities. Whether through regular expressions, EBNF grammar, or JSON Schema, you can precisely control the output format of the LLM, ensuring that the generated content conforms to the expected structure, which is crucial for applications requiring reliable data formats.&lt;/li>
&lt;/ul>
&lt;p>In summary, SGLang is not just an LLM inference acceleration engine but a complete programming and execution framework for complex generation tasks. It aims to enable developers to fully unleash the potential of large language models in an efficient and intuitive way.&lt;/p>
&lt;h2 id="2-core-features">2. Core Features&lt;/h2>
&lt;p>The power of SGLang lies in its unique design, which combines an intuitive frontend programming model with an efficient backend execution engine. Below are detailed introductions to several of its core features.&lt;/p>
&lt;h3 id="21-radixattention-kv-cache-optimization-for-complex-prompts">2.1 RadixAttention: KV Cache Optimization for Complex Prompts&lt;/h3>
&lt;p>When processing complex LLM programs, such as Chain-of-Thought, multi-turn dialogues, or agents that need to call tools, prompts often contain large shared prefixes. Traditional attention mechanisms produce redundant computation and storage when handling these shared prefixes.&lt;/p>
&lt;p>SGLang introduces &lt;strong>RadixAttention&lt;/strong>, a novel KV cache optimization technique. Its core idea is to organize prompts into a radix tree and perform attention calculations on this tree.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Automatic Sharing and Reuse&lt;/strong>: RadixAttention can automatically identify and share common prefixes between different requests, avoiding duplicate computation and storage. For example, in multi-turn dialogues, the conversation history of each turn can be losslessly reused by subsequent turns.&lt;/li>
&lt;li>&lt;strong>Performance Improvement&lt;/strong>: By maximizing KV cache reuse, RadixAttention significantly reduces memory usage and computational load, increasing throughput by 2 to 5 times, especially when handling long prompts or high-concurrency requests.&lt;/li>
&lt;/ul>
&lt;p>Below is a Mermaid diagram that visually demonstrates how RadixAttention handles requests with shared prefixes:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Traditional Method (No Sharing)&amp;quot;
req1[&amp;quot;Request 1: 'A B C D'&amp;quot;]
req2[&amp;quot;Request 2: 'A B E F'&amp;quot;]
kv1[&amp;quot;KV Cache: [A, B, C, D]&amp;quot;]
kv2[&amp;quot;KV Cache: [A, B, E, F]&amp;quot;]
req1 --&amp;gt; kv1
req2 --&amp;gt; kv2
end
subgraph &amp;quot;SGLang RadixAttention&amp;quot;
Root(&amp;quot;Root&amp;quot;) --&amp;gt; A(&amp;quot;Token 'A'&amp;quot;);
A --&amp;gt; B(&amp;quot;Token 'B'&amp;quot;);
B --&amp;gt; C(&amp;quot;Token 'C'&amp;quot;);
B --&amp;gt; E(&amp;quot;Token 'E'&amp;quot;);
C --&amp;gt; D(&amp;quot;Token 'D'&amp;quot;);
E --&amp;gt; F(&amp;quot;Token 'F'&amp;quot;);
style A fill:#9f9
style B fill:#9f9
end
&lt;/code>&lt;/pre>
&lt;p>In the diagram above, for two requests &lt;code>'A B C D'&lt;/code> and &lt;code>'A B E F'&lt;/code>, the traditional method creates two independent KV caches. RadixAttention, however, organizes them into a tree, sharing the computation and storage of the common prefix &lt;code>'A B'&lt;/code> (green nodes), creating new branches only for the different parts (C, D, E, F). This greatly improves memory and computational efficiency.&lt;/p>
&lt;h3 id="22-unified-frontend-programming-language-dsl">2.2 Unified Frontend Programming Language (DSL)&lt;/h3>
&lt;p>SGLang provides an expressive domain-specific language (DSL) deeply integrated with Python, allowing developers to build complex generation logic in a natural and intuitive way.&lt;/p>
&lt;h3 id="sglang-architecture-overview">SGLang Architecture Overview&lt;/h3>
&lt;p>To better understand how SGLang works, we can observe its core architecture through the following flowchart:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph User Side
A[Developer defines SGLang program&amp;lt;br&amp;gt;using function decorator] --&amp;gt; B{Call run method};
end
subgraph SGLang Frontend
B --&amp;gt; C[1. Parse Python AST&amp;lt;br&amp;gt;Separate deterministic logic and generation instructions];
C --&amp;gt; D[2. Build portable&amp;lt;br&amp;gt;SGLang IR intermediate representation];
end
subgraph Network Communication
D -- HTTP Request --&amp;gt; E[SGLang backend service SRT];
end
subgraph SGLang Backend SRT
E --&amp;gt; F[3. Receive IR and schedule];
F --&amp;gt; G{RadixAttention engine};
G --&amp;gt; H[4. Efficient execution&amp;lt;br&amp;gt;KV cache reuse];
H --&amp;gt; I[LLM/VLM model];
I --&amp;gt; J[5. Generate results];
end
subgraph Return Path
J -- HTTP Response --&amp;gt; K[Return results to frontend];
K --&amp;gt; L[6. Fill state object `s`];
L --&amp;gt; M[User gets final results];
end
style B fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style G fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>This diagram clearly shows how SGLang decouples and combines the programming convenience of the frontend with the high-performance execution engine of the backend.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Pythonic Control Flow&lt;/strong>: You can directly use standard Python control flow statements like &lt;code>if/else&lt;/code> and &lt;code>for&lt;/code> loops in SGLang functions to dynamically build prompts.&lt;/li>
&lt;li>&lt;strong>Integration of Generation and Logic&lt;/strong>: Through the &lt;code>@function&lt;/code> decorator and &lt;code>gen()&lt;/code> instruction, SGLang seamlessly combines the LLM's generation process (the &amp;ldquo;non-deterministic&amp;rdquo; part) with the program's deterministic logic.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: Generating Different Content Based on Conditions&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen
@function
def tool_use(s, question):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question)
s += assistant(
&amp;quot;To answer this question, I need to use a &amp;quot;
+ gen(&amp;quot;tool&amp;quot;, choices=[&amp;quot;calculator&amp;quot;, &amp;quot;search engine&amp;quot;])
+ &amp;quot;. &amp;quot;
)
if s[&amp;quot;tool&amp;quot;] == &amp;quot;calculator&amp;quot;:
s += assistant(&amp;quot;The math expression is: &amp;quot; + gen(&amp;quot;expression&amp;quot;))
elif s[&amp;quot;tool&amp;quot;] == &amp;quot;search engine&amp;quot;:
s += assistant(&amp;quot;The key word to search is: &amp;quot; + gen(&amp;quot;word&amp;quot;))
state = tool_use.run(&amp;quot;What is the population of London?&amp;quot;)
print(state[&amp;quot;tool&amp;quot;])
# Output: search engine
print(state[&amp;quot;word&amp;quot;])
# Output: population of London
&lt;/code>&lt;/pre>
&lt;p>In this example, the program first asks the LLM to choose between &amp;ldquo;calculator&amp;rdquo; and &amp;ldquo;search engine&amp;rdquo; as a tool, then executes different logic branches based on the LLM's choice, guiding the LLM to generate the next step of content.&lt;/p>
&lt;h3 id="23-powerful-structured-output">2.3 Powerful Structured Output&lt;/h3>
&lt;p>To ensure that content generated by the LLM can be reliably parsed and used by downstream programs, SGLang provides multiple powerful structured output constraint mechanisms.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Regular Expressions (Regex)&lt;/strong>: You can provide a regular expression to force the model's output to strictly match that pattern. This is useful for generating identifiers, numbers, or simple text fragments in specific formats.&lt;/p>
&lt;pre>&lt;code class="language-python">response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France?&amp;quot;}],
extra_body={&amp;quot;regex&amp;quot;: &amp;quot;(Paris|London)&amp;quot;},
)
# response.choices[0].message.content will necessarily be &amp;quot;Paris&amp;quot; or &amp;quot;London&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>EBNF Grammar&lt;/strong>: For more complex grammatical structures, you can use Extended Backus-Naur Form (EBNF) to define a complete grammar. This allows you to generate code, DSLs, or other structured text that strictly adheres to specific syntax.&lt;/p>
&lt;pre>&lt;code class="language-python">ebnf_grammar = &amp;quot;&amp;quot;&amp;quot;
root ::= city &amp;quot; is the capital of &amp;quot; country
city ::= &amp;quot;London&amp;quot; | &amp;quot;Paris&amp;quot; | &amp;quot;Berlin&amp;quot; | &amp;quot;Rome&amp;quot;
country ::= &amp;quot;England&amp;quot; | &amp;quot;France&amp;quot; | &amp;quot;Germany&amp;quot; | &amp;quot;Italy&amp;quot;
&amp;quot;&amp;quot;&amp;quot;
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information of the capital of France.&amp;quot;}],
extra_body={&amp;quot;ebnf&amp;quot;: ebnf_grammar},
)
# response.choices[0].message.content will be &amp;quot;Paris is the capital of France&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>JSON Schema&lt;/strong>: SGLang supports using JSON Schema to constrain the model to generate structured JSON objects. You can directly define a JSON Schema or use a Pydantic model to automatically generate one. This is crucial for APIs and data processing tasks that require reliable, verifiable JSON output.&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
class CapitalInfo(BaseModel):
name: str
population: int
response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information and population of the capital of France in the JSON format.&amp;quot;}],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;capital_info&amp;quot;,
&amp;quot;schema&amp;quot;: CapitalInfo.model_json_schema(),
},
},
)
# response.choices[0].message.content will be a JSON string conforming to the CapitalInfo structure
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="3-quick-start">3. Quick Start&lt;/h2>
&lt;p>This section will guide you through installing SGLang, starting the service, and basic usage, allowing you to experience SGLang's powerful features in just a few minutes.&lt;/p>
&lt;h3 id="31-installation">3.1 Installation&lt;/h3>
&lt;p>SGLang can be installed via &lt;code>pip&lt;/code> or the faster &lt;code>uv&lt;/code>. For the best experience and full functionality, it's recommended to install the &lt;code>all&lt;/code> version.&lt;/p>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install --upgrade pip
pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using uv (recommended, faster):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install uv
uv pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: The installation process may require compiling CUDA kernels (such as &lt;code>flashinfer&lt;/code>). Please ensure that the &lt;code>CUDA_HOME&lt;/code> environment variable is correctly configured in your environment and that the CUDA version is compatible with your PyTorch version.&lt;/p>
&lt;/blockquote>
&lt;h3 id="32-starting-the-backend-service-srt">3.2 Starting the Backend Service (SRT)&lt;/h3>
&lt;p>After installation, the next step is to start SGLang's backend service (SRT, SGLang Runtime). This service will load the specified language model and provide an interface compatible with the OpenAI API.&lt;/p>
&lt;p>Run the following command in your terminal:&lt;/p>
&lt;pre>&lt;code class="language-bash">python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Parameter Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>--model-path&lt;/code>: Specifies the path to the model to load. This can be a model name on the Hugging Face Hub (as shown in this example) or a local model path.&lt;/li>
&lt;li>&lt;code>--host&lt;/code>: The host address the service listens on. &lt;code>0.0.0.0&lt;/code> means allowing access from any network interface.&lt;/li>
&lt;li>&lt;code>--port&lt;/code>: The port number the service listens on.&lt;/li>
&lt;/ul>
&lt;p>When the service starts successfully, you'll see output similar to the following, indicating that the model has been loaded and is ready to receive requests.&lt;/p>
&lt;pre>&lt;code>INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
&lt;/code>&lt;/pre>
&lt;h3 id="33-sending-your-first-request">3.3 Sending Your First Request&lt;/h3>
&lt;p>With the service running, we can now interact with it using OpenAI's Python client library.&lt;/p>
&lt;p>Create a Python file named &lt;code>test_sglang.py&lt;/code> and fill it with the following content:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
# Initialize the client, pointing to our locally started SGLang service
client = openai.Client(
base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;,
api_key=&amp;quot;EMPTY&amp;quot; # SGLang service doesn't require an API Key
)
# Create a chat completion request
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;, # Must match the model loaded by the service
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France and why is it famous?&amp;quot;},
],
temperature=0.7,
max_tokens=150,
)
# Print the model's response
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;p>Run this script:&lt;/p>
&lt;pre>&lt;code class="language-bash">python test_sglang.py
&lt;/code>&lt;/pre>
&lt;p>You'll see the model's detailed answer about Paris. At this point, you've successfully completed the entire process from service deployment to inference request using SGLang!&lt;/p>
&lt;h2 id="4-frontend-language-sglang-dsl">4. Frontend Language (SGLang DSL)&lt;/h2>
&lt;p>SGLang's frontend language (DSL) is the core of its usability. It allows you to define complex generation processes in a declarative way, perfectly combining Python's flexibility with the generative capabilities of LLMs.&lt;/p>
&lt;h3 id="41-function-decorator">4.1 &lt;code>@function&lt;/code> Decorator&lt;/h3>
&lt;p>All SGLang programs begin with a Python function decorated by &lt;code>@function&lt;/code>. This decorator transforms an ordinary Python function into an executable SGLang program template.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>State Management&lt;/strong>: The first parameter of the function (typically named &lt;code>s&lt;/code>) represents the current generation state. It's a dictionary-like object used to store and pass all variables produced during the generation process.&lt;/li>
&lt;li>&lt;strong>Delayed Execution&lt;/strong>: Functions decorated with &lt;code>@function&lt;/code> are not executed immediately when defined. Instead, they create a reusable template. The program only executes when the &lt;code>.run()&lt;/code> or &lt;code>.run_batch()&lt;/code> method is called.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Interaction Flow&lt;/strong>&lt;/p>
&lt;p>The entire function call interaction flow can be represented by the following sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant App as Application (Python)
participant SGLang as SGLang Service
participant Tool as External Tool (e.g., Weather API)
User-&amp;gt;&amp;gt;+App: &amp;quot;What's the weather like in Boston?&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send request with messages and tools
SGLang-&amp;gt;&amp;gt;SGLang: Model decides to call get_current_weather
SGLang--&amp;gt;&amp;gt;-App: Return tool_calls with function name and parameters
App-&amp;gt;&amp;gt;App: Parse tool_calls
App-&amp;gt;&amp;gt;+Tool: Call get_current_weather(city=&amp;quot;Boston&amp;quot;, unit=&amp;quot;fahrenheit&amp;quot;)
Tool--&amp;gt;&amp;gt;-App: Return weather result: &amp;quot;68°F&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send new request with weather result
SGLang-&amp;gt;&amp;gt;SGLang: Model generates final reply based on weather result
SGLang--&amp;gt;&amp;gt;-App: Return final natural language reply
App--&amp;gt;&amp;gt;-User: &amp;quot;It's currently 68°F in Boston.&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>This sequence diagram clearly shows the complete loop from user question to model decision, tool call, result integration, and final response.&lt;/p>
&lt;h3 id="42-core-instructions">4.2 Core Instructions&lt;/h3>
&lt;p>Within SGLang functions, you use a series of instructions to build prompts and control the generation flow.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Role Instructions&lt;/strong>: &lt;code>system()&lt;/code>, &lt;code>user()&lt;/code>, &lt;code>assistant()&lt;/code>
These instructions are used to define different parts of a conversation, conforming to the standard multi-turn dialogue format. You can pass strings directly to them.&lt;/li>
&lt;li>&lt;strong>Generation Instruction&lt;/strong>: &lt;code>gen()&lt;/code>
This is the most important instruction in SGLang. It tells the LLM to generate text at the current position.
&lt;ul>
&lt;li>&lt;code>s += gen(&amp;quot;variable_name&amp;quot;, ...)&lt;/code>: The first parameter of &lt;code>gen()&lt;/code> is required and specifies the variable name in which the generation result will be stored in the state &lt;code>s&lt;/code>.&lt;/li>
&lt;li>&lt;code>max_tokens&lt;/code>: Limits the maximum number of tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code>: Defines one or more stop strings. When the model generates these strings, the generation process ends early.&lt;/li>
&lt;li>&lt;code>choices&lt;/code>: Provides a list of strings, forcing the model to choose one of these options for generation.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: A Complete Frontend Function&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
# Set the backend to the OpenAI-compatible service provided by SGLang
set_default_backend(OpenAI(&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;))
@function
def multi_turn_qa(s, question1, question2):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question1)
s += assistant(gen(&amp;quot;answer1&amp;quot;, max_tokens=128))
s += user(question2)
s += assistant(gen(&amp;quot;answer2&amp;quot;, max_tokens=128))
# Execute the SGLang program
state = multi_turn_qa.run(
question1=&amp;quot;What is the capital of the UK?&amp;quot;,
question2=&amp;quot;What is its population?&amp;quot;,
temperature=0.1
)
print(&amp;quot;Answer 1:&amp;quot;, state[&amp;quot;answer1&amp;quot;])
print(&amp;quot;Answer 2:&amp;quot;, state[&amp;quot;answer2&amp;quot;])
&lt;/code>&lt;/pre>
&lt;h3 id="43-streaming-output">4.3 Streaming Output&lt;/h3>
&lt;p>For applications requiring real-time feedback, SGLang supports streaming output. Simply set &lt;code>stream=True&lt;/code> in the &lt;code>.run()&lt;/code> method and iterate over the &lt;code>.text_iter()&lt;/code> method of the returned state object.&lt;/p>
&lt;pre>&lt;code class="language-python">state = multi_turn_qa.run(
question1=&amp;quot;Write a short story about a robot.&amp;quot;,
question2=&amp;quot;Continue the story.&amp;quot;,
stream=True
)
for out in state.text_iter(&amp;quot;answer2&amp;quot;):
print(out, end=&amp;quot;&amp;quot;, flush=True)
&lt;/code>&lt;/pre>
&lt;h2 id="5-backend-service-srt-and-api-reference">5. Backend Service (SRT) and API Reference&lt;/h2>
&lt;p>SGLang's backend, the SGLang Runtime (SRT), is a high-performance inference server implemented in Python. It's responsible for loading models, managing KV caches (through RadixAttention), and handling requests from clients. SRT provides two main API endpoints.&lt;/p>
&lt;h3 id="51-native-api-generate">5.1 Native API: &lt;code>/generate&lt;/code>&lt;/h3>
&lt;p>This is a lower-level API that provides the finest control over the generation process.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /generate&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Generate text starting from a given text prompt.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>text&lt;/code> (string, required): The input text prompt.&lt;/li>
&lt;li>&lt;code>sampling_params&lt;/code> (object, optional): A JSON object containing sampling parameters.
&lt;ul>
&lt;li>&lt;code>temperature&lt;/code> (float): Sampling temperature.&lt;/li>
&lt;li>&lt;code>max_new_tokens&lt;/code> (int): Maximum number of new tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code> (string or list[string]): Stop tokens.&lt;/li>
&lt;li>&lt;code>json_schema&lt;/code> (string): JSON Schema string for constraining output.&lt;/li>
&lt;li>&lt;code>regex&lt;/code> (string): Regular expression for constraining output.&lt;/li>
&lt;li>&lt;code>ebnf&lt;/code> (string): EBNF grammar for constraining output.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>stream&lt;/code> (boolean, optional): Whether to use streaming.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using &lt;code>requests&lt;/code>)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import requests
import json
url = &amp;quot;http://127.0.0.1:30000/generate&amp;quot;
data = {
&amp;quot;text&amp;quot;: &amp;quot;The capital of France is&amp;quot;,
&amp;quot;sampling_params&amp;quot;: {
&amp;quot;temperature&amp;quot;: 0,
&amp;quot;max_new_tokens&amp;quot;: 16,
}
}
response = requests.post(url, json=data)
print(response.json())
# {'text': ' Paris.\n\nThe capital of France is Paris. It is the most populous city in', 'meta': ...}
&lt;/code>&lt;/pre>
&lt;h3 id="52-openai-compatible-api-v1chatcompletions">5.2 OpenAI Compatible API: &lt;code>/v1/chat/completions&lt;/code>&lt;/h3>
&lt;p>For easy migration and integration, SGLang provides a chat completion API fully compatible with OpenAI. You can seamlessly use OpenAI's official client library.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /v1/chat/completions&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Perform chat-style text generation.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>model&lt;/code> (string, required): The name of the model.&lt;/li>
&lt;li>&lt;code>messages&lt;/code> (list[object], required): List of conversation messages.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code>, &lt;code>max_tokens&lt;/code>, &lt;code>stream&lt;/code>, etc.&lt;/li>
&lt;li>&lt;code>response_format&lt;/code> (object, optional): For specifying structured output, such as &lt;code>{&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;, &amp;quot;json_schema&amp;quot;: ...}&lt;/code>.&lt;/li>
&lt;li>&lt;code>extra_body&lt;/code> (object, optional): SGLang-specific extension parameters, such as &lt;code>{&amp;quot;regex&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code> or &lt;code>{&amp;quot;ebnf&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using the &lt;code>openai&lt;/code> library)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
client = openai.Client(base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;, api_key=&amp;quot;EMPTY&amp;quot;)
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;List 3 countries and their capitals.&amp;quot;}],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-usage-function-callingtool-usage">6. Advanced Usage: Function Calling/Tool Usage&lt;/h2>
&lt;p>SGLang's powerful programming model makes it very suitable for building AI agents capable of calling external tools. This is typically achieved through structured output, where the model is guided to generate text in a specific format (usually JSON) describing a function call.&lt;/p>
&lt;p>Here are the steps to build a simple weather query agent:&lt;/p>
&lt;p>&lt;strong>1. Define Tool Schema&lt;/strong>&lt;/p>
&lt;p>First, use JSON Schema to define your tool. This tells the model the name of the tool, its purpose, and what parameters it needs.&lt;/p>
&lt;pre>&lt;code class="language-python">tools = [
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;get_current_weather&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Get the current weather in a given location&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;city&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;The city name&amp;quot;},
&amp;quot;unit&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;enum&amp;quot;: [&amp;quot;celsius&amp;quot;, &amp;quot;fahrenheit&amp;quot;]},
},
&amp;quot;required&amp;quot;: [&amp;quot;city&amp;quot;, &amp;quot;unit&amp;quot;],
},
},
}
]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>2. Guide the Model to Make Function Calls&lt;/strong>&lt;/p>
&lt;p>In the &lt;code>messages&lt;/code> sent to the model, include a system prompt indicating that the model can use these tools. Then, pass &lt;code>tools&lt;/code> and &lt;code>tool_choice=&amp;quot;auto&amp;quot;&lt;/code> in the API call.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
messages = [
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant that can access external tools.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What's the weather like in Boston in fahrenheit?&amp;quot;}
]
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=messages,
tools=tools,
tool_choice=&amp;quot;auto&amp;quot;,
)
# Check if the model decided to call a tool
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
if tool_calls:
# Model decided to call a tool
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f&amp;quot;Function Call: {function_name}&amp;quot;)
print(f&amp;quot;Arguments: {function_args}&amp;quot;)
# Here, you could actually execute the function call
# e.g., result = get_current_weather(**function_args)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Output:&lt;/strong>&lt;/p>
&lt;pre>&lt;code>Function Call: get_current_weather
Arguments: {'city': 'Boston', 'unit': 'fahrenheit'}
&lt;/code>&lt;/pre>
&lt;p>In this way, you can build powerful AI applications capable of interacting with the external world.&lt;/p></description></item><item><title>Llama.cpp Technical Guide: Lightweight LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</link><pubDate>Thu, 26 Jun 2025 01:06:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.&lt;/p>
&lt;p>&lt;strong>Core Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Performance:&lt;/strong> Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).&lt;/li>
&lt;li>&lt;strong>Lightweight:&lt;/strong> Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.&lt;/li>
&lt;li>&lt;strong>Cross-Platform:&lt;/strong> Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.&lt;/li>
&lt;li>&lt;strong>Open Ecosystem:&lt;/strong> Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.&lt;/li>
&lt;li>&lt;strong>Continuous Innovation:&lt;/strong> Quickly follows and implements the latest model architectures and inference optimization techniques.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;h3 id="21-gguf-model-format">2.1. GGUF Model Format&lt;/h3>
&lt;p>GGUF (Georgi Gerganov Universal Format) is the core model file format used by &lt;code>llama.cpp&lt;/code>, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.&lt;/p>
&lt;p>&lt;strong>Key Features:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified File:&lt;/strong> Packages model metadata, vocabulary, and all tensors (weights) in a single file.&lt;/li>
&lt;li>&lt;strong>Extensibility:&lt;/strong> Allows adding new metadata without breaking compatibility.&lt;/li>
&lt;li>&lt;strong>Backward Compatibility:&lt;/strong> Guarantees compatibility with older versions of GGUF models.&lt;/li>
&lt;li>&lt;strong>Memory Efficiency:&lt;/strong> Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.&lt;/li>
&lt;/ul>
&lt;h3 id="22-quantization">2.2. Quantization&lt;/h3>
&lt;p>Quantization is one of the core advantages of &lt;code>llama.cpp&lt;/code>. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).&lt;/p>
&lt;p>&lt;strong>Main Benefits:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size:&lt;/strong> Significantly reduces the size of model files, making them easier to distribute and store.&lt;/li>
&lt;li>&lt;strong>Lower Memory Usage:&lt;/strong> Reduces the RAM required to load the model into memory.&lt;/li>
&lt;li>&lt;strong>Faster Inference:&lt;/strong> Low-precision calculations are typically faster than high-precision ones, especially on CPUs.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>llama.cpp&lt;/code> supports various quantization methods, particularly &lt;strong>k-quants&lt;/strong>, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.&lt;/p>
&lt;h3 id="23-multimodal-support">2.3. Multimodal Support&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Supported Models:&lt;/strong> Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.&lt;/li>
&lt;li>&lt;strong>Working Principle:&lt;/strong> Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.&lt;/li>
&lt;li>&lt;strong>Tools:&lt;/strong> &lt;code>llama-mtmd-cli&lt;/code> and &lt;code>llama-server&lt;/code> provide native support for multimodal models.&lt;/li>
&lt;/ul>
&lt;h2 id="3-usage-methods">3. Usage Methods&lt;/h2>
&lt;h3 id="31-compilation">3.1. Compilation&lt;/h3>
&lt;p>Compiling &lt;code>llama.cpp&lt;/code> from source is very simple.&lt;/p>
&lt;pre>&lt;code class="language-bash">git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make
&lt;/code>&lt;/pre>
&lt;p>For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:&lt;/p>
&lt;pre>&lt;code class="language-bash"># For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1
&lt;/code>&lt;/pre>
&lt;h3 id="32-basic-inference">3.2. Basic Inference&lt;/h3>
&lt;p>After compilation, you can use the &lt;code>llama-cli&lt;/code> tool for inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p &amp;quot;Building a website can be done in 10 simple steps:&amp;quot; -n 400
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>-m&lt;/code>: Specifies the path to the GGUF model file.&lt;/li>
&lt;li>&lt;code>-p&lt;/code>: Specifies the prompt.&lt;/li>
&lt;li>&lt;code>-n&lt;/code>: Specifies the maximum number of tokens to generate.&lt;/li>
&lt;/ul>
&lt;h3 id="33-openai-compatible-server">3.3. OpenAI Compatible Server&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.&lt;/p>
&lt;p>Starting the server:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096
&lt;/code>&lt;/pre>
&lt;p>You can then send requests to &lt;code>http://localhost:8080/v1/chat/completions&lt;/code> just like you would with the OpenAI API.&lt;/p>
&lt;h2 id="4-advanced-features">4. Advanced Features&lt;/h2>
&lt;h3 id="41-speculative-decoding">4.1. Speculative Decoding&lt;/h3>
&lt;p>This is an advanced inference optimization technique that significantly accelerates generation speed by using a small &amp;ldquo;draft&amp;rdquo; model to predict the output of the main model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle:&lt;/strong> The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.&lt;/li>
&lt;li>&lt;strong>Usage:&lt;/strong> Use the &lt;code>--draft-model&lt;/code> parameter in &lt;code>llama-cli&lt;/code> or &lt;code>llama-server&lt;/code> to specify a small, fast draft model.&lt;/li>
&lt;/ul>
&lt;h3 id="42-lora-support">4.2. LoRA Support&lt;/h3>
&lt;p>LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. &lt;code>llama.cpp&lt;/code> supports loading one or more LoRA adapters during inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base-model.gguf --lora lora-adapter.gguf
&lt;/code>&lt;/pre>
&lt;p>You can even set different weights for different LoRA adapters:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5
&lt;/code>&lt;/pre>
&lt;h3 id="43-grammars">4.3. Grammars&lt;/h3>
&lt;p>Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Format:&lt;/strong> Uses a format called GBNF (GGML BNF) to define grammar rules.&lt;/li>
&lt;li>&lt;strong>Application:&lt;/strong> By providing GBNF rules through the &lt;code>grammar&lt;/code> parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example:&lt;/strong> Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))
&lt;/code>&lt;/pre>
&lt;h2 id="5-ecosystem">5. Ecosystem&lt;/h2>
&lt;p>The success of &lt;code>llama.cpp&lt;/code> has spawned a vibrant ecosystem:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python&lt;/a>:&lt;/strong> The most popular Python binding, providing interfaces to almost all features of &lt;code>llama.cpp&lt;/code> and deeply integrated with frameworks like LangChain and LlamaIndex.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://ollama.com/">Ollama&lt;/a>:&lt;/strong> A tool for packaging, distributing, and running models, using &lt;code>llama.cpp&lt;/code> under the hood, greatly simplifying the process of running LLMs locally.&lt;/li>
&lt;li>&lt;strong>Numerous UI Tools:&lt;/strong> The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion">6. Conclusion&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), &lt;code>llama.cpp&lt;/code> provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.&lt;/p></description></item><item><title>vLLM Technical Guide: High-Performance LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/vllm-documentation/</link><pubDate>Thu, 26 Jun 2025 01:05:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/vllm-documentation/</guid><description>&lt;h2 id="1-introduction-to-vllm">1. Introduction to vLLM&lt;/h2>
&lt;p>vLLM is an open-source inference and serving engine designed for large language models (LLMs), renowned for its high throughput and memory efficiency. In the field of LLM serving, vLLM addresses a core pain point: traditional inference systems are inefficient when handling the key-value cache (KV Cache) in Transformer models&amp;rsquo; attention mechanism, resulting in significant memory waste and limited inference speed.&lt;/p>
&lt;p>The memory bottleneck in LLM inference primarily stems from the KV Cache. This cache stores attention keys and values for each previous token in a sequence to accelerate the generation of subsequent tokens. However, the size of the KV Cache is dynamic and difficult to predict, creating enormous challenges for memory management. Traditional systems (like HuggingFace Transformers) typically pre-allocate a large continuous memory space to store the KV Cache, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>vLLM fundamentally solves this problem by introducing its core innovation: the &lt;strong>PagedAttention&lt;/strong> mechanism.&lt;/p>
&lt;h2 id="2-core-features-and-advantages">2. Core Features and Advantages&lt;/h2>
&lt;p>vLLM stands out among numerous LLM inference frameworks thanks to several key features:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extremely High Throughput&lt;/strong>: Through PagedAttention and Continuous Batching, vLLM significantly improves GPU utilization. Its throughput is several times higher than HuggingFace Transformers and outperforms other mainstream inference libraries.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Management&lt;/strong>: The PagedAttention mechanism divides the KV Cache into non-continuous memory blocks, greatly reducing internal and external memory fragmentation. According to official data, it can save up to 55% of memory, meaning you can load larger models or serve more concurrent requests with the same hardware.&lt;/li>
&lt;li>&lt;strong>Flexible Decoding Strategies&lt;/strong>: vLLM supports various complex decoding algorithms, including Parallel Sampling, Beam Search, and Top-K/Top-P sampling, meeting the needs of different application scenarios.&lt;/li>
&lt;li>&lt;strong>OpenAI API Compatibility&lt;/strong>: vLLM provides a service endpoint that is fully compatible with the OpenAI API. This means you can seamlessly integrate vLLM into existing application ecosystems built on the OpenAI API with just a few configuration changes.&lt;/li>
&lt;li>&lt;strong>Distributed Inference&lt;/strong>: For ultra-large models that cannot fit on a single GPU, vLLM supports Tensor Parallelism, distributing model weights and computational load across multiple GPUs for efficient distributed inference.&lt;/li>
&lt;li>&lt;strong>Streaming and Structured Output&lt;/strong>: Supports streaming of generated tokens and can produce structured outputs in specific formats (such as JSON Schema or regular expressions) through Guided Generation.&lt;/li>
&lt;/ul>
&lt;h2 id="3-core-architecture-deep-dive-into-pagedattention">3. Core Architecture: Deep Dive into PagedAttention&lt;/h2>
&lt;p>PagedAttention is the soul of vLLM, with its design inspiration coming from the paging technique used in modern operating systems to manage virtual memory.&lt;/p>
&lt;h3 id="31-working-principle">3.1 Working Principle&lt;/h3>
&lt;p>In traditional methods, the KV Cache for each sequence is stored in continuous memory space. While this approach seems simple, it leads to severe memory fragmentation due to the vast differences in sequence lengths.&lt;/p>
&lt;p>PagedAttention divides each sequence's KV Cache into fixed-size &lt;strong>blocks&lt;/strong>. Each block can store keys and values for a fixed number of tokens. During inference, vLLM's core scheduler dynamically allocates these blocks to sequences as needed.&lt;/p>
&lt;p>The advantages of this design include:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Eliminating Internal Fragmentation&lt;/strong>: Since blocks are of fixed size, a sequence's last block may have some unused space, but this waste is far less than that caused by reserving continuous memory for the entire sequence.&lt;/li>
&lt;li>&lt;strong>Flexible Memory Allocation&lt;/strong>: Blocks are stored in non-continuous memory space, making memory management more flexible, similar to how operating systems manage physical memory pages.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Sharing&lt;/strong>: PagedAttention makes sharing KV Cache between different sequences exceptionally simple and efficient. For example, in parallel sampling or beam search, multiple candidate sequences originate from the same prompt. vLLM allows these sequences to share KV blocks storing the prompt portion, only needing to allocate new, independent blocks for each sequence when generating new tokens. This &amp;ldquo;Copy-on-Write&amp;rdquo; mechanism greatly reduces the memory overhead of complex decoding algorithms.&lt;/li>
&lt;/ol>
&lt;p>Below is a Mermaid diagram that more intuitively illustrates PagedAttention's memory management approach:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Physical_Memory [KV Cache Physical Memory]
direction LR
B1(Block 1)
B2(Block 2)
B3(Block 3)
B4(Block 4)
B5(Block 5)
B6(Block 6)
B7(Block 7)
B8(Block 8)
end
subgraph Logical_View [Sequence Logical View]
direction TB
subgraph Seq1 [Sequence 1]
P1(Prompt) --&amp;gt; T1(Token 1)
end
subgraph Seq2 [Sequence 2]
P2(Prompt) --&amp;gt; T2(Token 1) --&amp;gt; T3(Token 2)
end
subgraph Seq3 [Parallel Sampling]
P3(Prompt) --&amp;gt; T4(Token 1a)
P3 --&amp;gt; T5(Token 1b)
end
end
subgraph Block_Table [Block Table]
direction TB
Map1[&amp;quot;Seq 1: [B1, B5]&amp;quot;]
Map2[&amp;quot;Seq 2: [B2, B6, B8]&amp;quot;]
Map3[&amp;quot;Seq 3a: [B3, B7]&amp;quot;]
Map4[&amp;quot;Seq 3b: [B3, B4]&amp;quot;]
end
Seq1 --&amp;gt; Map1
Seq2 --&amp;gt; Map2
Seq3 --&amp;gt; Map3
Seq3 --&amp;gt; Map4
Map1 --&amp;gt; B1
Map1 --&amp;gt; B5
Map2 --&amp;gt; B2
Map2 --&amp;gt; B6
Map2 --&amp;gt; B8
Map3 --&amp;gt; B3
Map3 --&amp;gt; B7
Map4 --&amp;gt; B3
Map4 --&amp;gt; B4
style B3 fill:#f9f,stroke:#333,stroke-width:2px
linkStyle 8 stroke-width:2px,stroke:green,fill:none;
linkStyle 11 stroke-width:2px,stroke:green,fill:none;
linkStyle 12 stroke-width:2px,stroke:green,fill:none;
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Diagram explanation:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>KV Cache Physical Memory&lt;/strong>: Represents non-continuous physical memory blocks on the GPU.&lt;/li>
&lt;li>&lt;strong>Sequence Logical View&lt;/strong>: Represents multiple requests (sequences) being processed.&lt;/li>
&lt;li>&lt;strong>Block Table&lt;/strong>: vLLM's core component that maps logical token positions to physical memory blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: Note that the two branches in &amp;ldquo;Parallel Sampling&amp;rdquo; (3a and 3b) share the same Prompt block (B3), demonstrating PagedAttention's efficient memory sharing.&lt;/li>
&lt;/ul>
&lt;h3 id="32-continuous-batching">3.2 Continuous Batching&lt;/h3>
&lt;p>Based on PagedAttention, vLLM implements a more advanced batching strategy—continuous batching. Traditional static batching requires waiting for all sequences in a batch to complete generation before processing the next batch. Continuous batching, however, allows new requests to be inserted into the batch immediately after a sequence in the batch completes generation, avoiding GPU idle waiting and further improving throughput.&lt;/p>
&lt;p>Below is a comparison of the two batching methods using a Mermaid sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant C as Client
participant S as Server
participant G as GPU
note over C, G: --- Static Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process Batch 1 [R1, R2, R3, R4]
note right of G: All requests process in parallel
G--&amp;gt;&amp;gt;S: Batch 1 Finished
note right of S: Wait for the entire batch to complete
S--&amp;gt;&amp;gt;C: Response [O1, O2, O3, O4]
C-&amp;gt;&amp;gt;S: Request [R5, R6]
S-&amp;gt;&amp;gt;G: Process Batch 2 [R5, R6]
note over C, G: --- Continuous Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process [R1, R2, R3, R4]
G--&amp;gt;&amp;gt;S: R2 Finished
S--&amp;gt;&amp;gt;C: Response O2
C-&amp;gt;&amp;gt;S: New Request R5
S-&amp;gt;&amp;gt;G: Add R5 to queue (GPU is not idle)
note right of G: R1, R3, R4, R5 are now processing
G--&amp;gt;&amp;gt;S: R4 Finished
S--&amp;gt;&amp;gt;C: Response O4
&lt;/code>&lt;/pre>
&lt;h2 id="4-quick-start-guide">4. Quick Start Guide&lt;/h2>
&lt;p>Below, we'll demonstrate how to install and use vLLM through a few simple steps.&lt;/p>
&lt;h3 id="41-installation">4.1 Installation&lt;/h3>
&lt;p>You can install vLLM using either &lt;code>pip&lt;/code> or &lt;code>uv&lt;/code> (a faster package installation tool). Using &lt;code>uv&lt;/code> is recommended as it can automatically detect your CUDA version and install the matching PyTorch backend.&lt;/p>
&lt;p>&lt;strong>Using uv (recommended):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install vllm --torch-backend=auto
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;h3 id="42-offline-inference">4.2 Offline Inference&lt;/h3>
&lt;p>The &lt;code>vllm.LLM&lt;/code> class makes offline inference very convenient.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
# Define input prompts
prompts = [
&amp;quot;Hello, my name is&amp;quot;,
&amp;quot;The capital of France is&amp;quot;,
&amp;quot;The future of AI is&amp;quot;,
]
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine (model will be automatically downloaded from Hugging Face)
llm = LLM(model=&amp;quot;facebook/opt-125m&amp;quot;)
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="43-launching-an-openaicompatible-server">4.3 Launching an OpenAI-Compatible Server&lt;/h3>
&lt;p>One of vLLM's most powerful features is its built-in API server. With just one command, you can start a service compatible with the OpenAI API.&lt;/p>
&lt;pre>&lt;code class="language-bash">vllm serve Qwen/Qwen2.5-1.5B-Instruct
&lt;/code>&lt;/pre>
&lt;p>By default, the server will run on &lt;code>http://localhost:8000&lt;/code>.&lt;/p>
&lt;h3 id="44-interacting-with-the-server">4.4 Interacting with the Server&lt;/h3>
&lt;p>You can interact with the server using &lt;code>curl&lt;/code> or the &lt;code>openai&lt;/code> Python client.&lt;/p>
&lt;p>&lt;strong>Using curl:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7,
&amp;quot;temperature&amp;quot;: 0
}'
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using the OpenAI Python client:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url=&amp;quot;http://localhost:8000/v1&amp;quot;,
api_key=&amp;quot;not-used&amp;quot; # API key is not required
)
completion = client.chat.completions.create(
model=&amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Who won the world series in 2020?&amp;quot;}
]
)
print(completion.choices[0].message)
&lt;/code>&lt;/pre>
&lt;h2 id="5-model-serving">5. Model Serving&lt;/h2>
&lt;h3 id="51-distributed-serving">5.1 Distributed Serving&lt;/h3>
&lt;p>If a model is too large to fit on a single GPU, you can distribute it across multiple GPUs using tensor parallelism.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Start a service on 4 GPUs
vllm serve facebook/opt-13b --tensor-parallel-size 4
&lt;/code>&lt;/pre>
&lt;h3 id="52-docker-deployment">5.2 Docker Deployment&lt;/h3>
&lt;p>vLLM provides official Docker images for convenient containerized deployment.&lt;/p>
&lt;pre>&lt;code class="language-bash">docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env &amp;quot;HUGGING_FACE_HUB_TOKEN=&amp;lt;your-hf-token&amp;gt;&amp;quot; \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-structured-outputs">6.1 Structured Outputs&lt;/h3>
&lt;p>vLLM supports various ways to constrain the model's output format, which is crucial for applications requiring reliable, parsable outputs.&lt;/p>
&lt;p>&lt;strong>Generating JSON using Pydantic models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:8000/v1&amp;quot;, api_key=&amp;quot;dummy&amp;quot;)
model = client.models.list().data[0].id
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Generate a JSON with the name and age of one random person.&amp;quot;}
],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;people&amp;quot;,
&amp;quot;schema&amp;quot;: People.model_json_schema()
}
},
)
print(completion.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h3 id="62-lora-support">6.2 LoRA Support&lt;/h3>
&lt;p>vLLM can efficiently serve multiple LoRA adapters on the same base model. This is particularly useful for scenarios requiring customized models for different customers or tasks.&lt;/p>
&lt;p>&lt;strong>Starting a server with LoRA support:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Specifying a LoRA adapter in a request:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;sql-lora&amp;quot;, # Specify the LoRA model ID
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7
}'
&lt;/code>&lt;/pre>
&lt;h3 id="63-quantization">6.3 Quantization&lt;/h3>
&lt;p>Quantization is a technique to reduce model size and memory usage by lowering the precision of model weights. vLLM supports various quantization schemes, such as AWQ and FP8 KV cache.&lt;/p>
&lt;p>&lt;strong>Enabling FP8 KV cache:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;,
calculate_kv_scales=True # Dynamically calculate quantization scales
)
&lt;/code>&lt;/pre>
&lt;h2 id="7-framework-integration">7. Framework Integration&lt;/h2>
&lt;p>vLLM can be easily integrated with popular LLM application frameworks like Langchain and LlamaIndex for building complex systems such as Retrieval-Augmented Generation (RAG). Typically, vLLM serves as a backend providing fast LLM inference and embedding generation services.&lt;/p>
&lt;p>&lt;strong>Installing related dependencies:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install -U vllm langchain_openai langchain_community
&lt;/code>&lt;/pre>
&lt;p>Afterward, in Langchain, you can point the &lt;code>base_url&lt;/code> of &lt;code>ChatOpenAI&lt;/code> or &lt;code>OpenAIEmbeddings&lt;/code> to your vLLM server's address to complete the integration.&lt;/p>
&lt;h2 id="8-conclusion">8. Conclusion&lt;/h2>
&lt;p>Through its innovative PagedAttention architecture, vLLM successfully addresses memory management and performance bottlenecks in LLM inference, providing developers with an extremely efficient, flexible, and easy-to-use inference serving engine. Whether conducting quick offline experiments or deploying production-grade, high-concurrency LLM services, vLLM demonstrates excellent performance and powerful functionality. As the community continues to develop, vLLM is becoming one of the standard tools in the field of LLM serving.&lt;/p></description></item></channel></rss>