<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Machine Learning | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/categories/machine-learning/</link><atom:link href="https://ziyanglin.netlify.app/en/categories/machine-learning/index.xml" rel="self" type="application/rss+xml"/><description>Machine Learning</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Fri, 27 Jun 2025 05:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Machine Learning</title><link>https://ziyanglin.netlify.app/en/categories/machine-learning/</link></image><item><title>CLIP Technology Analysis: Unified Representation Through Image-Text Contrastive Learning</title><link>https://ziyanglin.netlify.app/en/post/clip-documentation/</link><pubDate>Fri, 27 Jun 2025 05:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/clip-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>CLIP (Contrastive Language-Image Pre-training) is an advanced deep learning model developed by OpenAI, designed to understand the relationship between images and the text that describes them. Through pre-training on millions of (image, text) pairs, CLIP learns a shared multimodal embedding space that maps both images and text to vectors within this space.&lt;/p>
&lt;p>The revolutionary aspect of CLIP lies in its powerful &lt;strong>Zero-Shot Learning&lt;/strong> capabilities. Traditional image classification models typically require training for specific tasks and labels, whereas CLIP can classify images into categories it has never explicitly seen during training, greatly enhancing the model's generalization ability and flexibility.&lt;/p>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>To understand CLIP, we first need to grasp several core concepts:&lt;/p>
&lt;h3 id="21-multimodal-learning">2.1 Multimodal Learning&lt;/h3>
&lt;p>Multimodal learning refers to the ability of models to process and associate information from different modalities (such as text, images, audio). Humans understand the world by combining visual, auditory, and linguistic information, and multimodal learning aims to give AI similar capabilities. CLIP is an outstanding example of multimodal learning in the domains of images and text.&lt;/p>
&lt;h3 id="22-contrastive-learning">2.2 Contrastive Learning&lt;/h3>
&lt;p>Contrastive learning is a self-supervised learning method. Its core idea is to &lt;strong>bring similar samples closer together in the representation space while pushing dissimilar samples apart&lt;/strong>.&lt;/p>
&lt;p>Imagine a large collection of &amp;ldquo;image-text&amp;rdquo; pairs. For a given image (e.g., a picture of a cat), its corresponding text description (&amp;ldquo;a photo of a cat&amp;rdquo;) is a positive sample, while all other text descriptions (e.g., &amp;ldquo;a photo of a dog&amp;rdquo;, &amp;ldquo;a photo of a car&amp;rdquo;) are negative samples. CLIP's goal is to learn an encoder that makes the representation of &amp;ldquo;a cat picture&amp;rdquo; and &amp;ldquo;a photo of a cat&amp;rdquo; very close in the vector space, while keeping representations of unrelated text descriptions far apart.&lt;/p>
&lt;h3 id="23-zeroshot-learning">2.3 Zero-Shot Learning&lt;/h3>
&lt;p>Zero-shot learning refers to a model's ability to recognize and classify categories it has never seen during training. CLIP achieves this by transforming image classification into an image-text matching problem.&lt;/p>
&lt;p>For example, to determine if an image is a &amp;ldquo;dog,&amp;rdquo; we don't need a model specifically trained to recognize &amp;ldquo;dogs.&amp;rdquo; We simply encode the image into a vector, encode the text &amp;ldquo;a photo of a dog&amp;rdquo; into another vector, and then calculate the similarity between these two vectors. If the similarity is high, we can consider the image to be a &amp;ldquo;dog.&amp;rdquo; This approach allows CLIP to identify objects of any category, as long as we can describe it in text.&lt;/p>
&lt;h2 id="3-model-architecture">3. Model Architecture&lt;/h2>
&lt;p>The CLIP model consists of two main components: an image encoder and a text encoder.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Image Encoder&lt;/strong>: Responsible for converting input images into feature vectors. CLIP uses two mainstream architectures:
&lt;ul>
&lt;li>&lt;strong>ResNet&lt;/strong>: A classic convolutional neural network.&lt;/li>
&lt;li>&lt;strong>Vision Transformer (ViT)&lt;/strong>: A model that applies the Transformer architecture to image recognition.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Text Encoder&lt;/strong>: Responsible for converting input text into feature vectors. CLIP uses the standard &lt;strong>Transformer&lt;/strong> architecture.&lt;/li>
&lt;/ul>
&lt;p>These two encoders map images and text to the same multi-dimensional embedding space, allowing their vector representations to be directly compared.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;CLIP Model&amp;quot;
direction LR
subgraph &amp;quot;Image Encoder (ViT or ResNet)&amp;quot;
I[Image] --&amp;gt; IE(Encoder) --&amp;gt; IV[Image Feature Vector]
end
subgraph &amp;quot;Text Encoder (Transformer)&amp;quot;
T[Text] --&amp;gt; TE(Encoder) --&amp;gt; TV[Text Feature Vector]
end
end
IV -- &amp;quot;Cosine Similarity&amp;quot; --&amp;gt; S(Similarity Score)
TV -- &amp;quot;Cosine Similarity&amp;quot; --&amp;gt; S
&lt;/code>&lt;/pre>
&lt;h2 id="4-workflow">4. Workflow&lt;/h2>
&lt;p>CLIP's workflow is divided into training and inference phases.&lt;/p>
&lt;h3 id="41-training-phase">4.1 Training Phase&lt;/h3>
&lt;p>During the training phase, CLIP learns from a dataset containing hundreds of millions of (image, text) pairs. For a batch of data containing N (image, text) pairs, CLIP performs the following operations:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Encoding&lt;/strong>: Pass N images through the image encoder to get N image feature vectors, and pass N texts through the text encoder to get N text feature vectors.&lt;/li>
&lt;li>&lt;strong>Calculate Similarity&lt;/strong>: Compute the cosine similarity between each of the N image feature vectors and each of the N text feature vectors, resulting in an N x N similarity matrix.&lt;/li>
&lt;li>&lt;strong>Contrastive Learning&lt;/strong>: In this matrix, the elements on the diagonal correspond to the correct (image, text) pairs, which we want to have high similarity. Elements off the diagonal represent mismatched pairs, which we want to have low similarity. The model is optimized through a contrastive loss function to achieve this goal.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Input a batch of (image, text) pairs&amp;quot;] --&amp;gt; B{&amp;quot;Encoding&amp;quot;};
B --&amp;gt; C[&amp;quot;Image Encoder&amp;quot;] --&amp;gt; D[&amp;quot;Image Feature Vectors&amp;quot;];
B --&amp;gt; E[&amp;quot;Text Encoder&amp;quot;] --&amp;gt; F[&amp;quot;Text Feature Vectors&amp;quot;];
D &amp;amp; F --&amp;gt; G{&amp;quot;Calculate Cosine Similarity Matrix&amp;quot;};
G --&amp;gt; H[&amp;quot;Contrastive Loss Function&amp;quot;];
H --&amp;gt; I[&amp;quot;Optimize Model Parameters&amp;quot;];
&lt;/code>&lt;/pre>
&lt;h3 id="42-inference-phase-zeroshot-classification">4.2 Inference Phase (Zero-Shot Classification)&lt;/h3>
&lt;p>During the inference phase, CLIP can perform zero-shot image classification tasks:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Prepare Text Prompts&lt;/strong>: For all categories you want to classify (e.g., &amp;ldquo;cat&amp;rdquo;, &amp;ldquo;dog&amp;rdquo;, &amp;ldquo;car&amp;rdquo;), create a series of text prompts such as &amp;ldquo;a photo of a cat&amp;rdquo;, &amp;ldquo;a photo of a dog&amp;rdquo;, &amp;ldquo;a photo of a car&amp;rdquo;.&lt;/li>
&lt;li>&lt;strong>Encode Text&lt;/strong>: Convert these text prompts into a series of text feature vectors using the text encoder.&lt;/li>
&lt;li>&lt;strong>Encode Image&lt;/strong>: Convert the image to be classified into an image feature vector using the image encoder.&lt;/li>
&lt;li>&lt;strong>Calculate Similarity&lt;/strong>: Compute the cosine similarity between the image feature vector and all text feature vectors.&lt;/li>
&lt;li>&lt;strong>Prediction&lt;/strong>: The category corresponding to the text prompt with the highest similarity is CLIP's prediction result.&lt;/li>
&lt;/ol>
&lt;h2 id="5-applications">5. Applications&lt;/h2>
&lt;p>CLIP's powerful capabilities make it widely applicable in many fields:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Zero-Shot Image Classification&lt;/strong>: Classify images into arbitrary categories without additional training.&lt;/li>
&lt;li>&lt;strong>Image Retrieval&lt;/strong>: Search for matching images using natural language descriptions.&lt;/li>
&lt;li>&lt;strong>Content Moderation&lt;/strong>: Automatically identify and filter inappropriate image content.&lt;/li>
&lt;li>&lt;strong>Guiding Generative Models&lt;/strong>: CLIP's multimodal understanding ability can guide generative models (like DALL-E 2) to create images that match text descriptions.&lt;/li>
&lt;/ul>
&lt;h2 id="6-code-example">6. Code Example&lt;/h2>
&lt;p>Here's a simple Python code example demonstrating how to use the &lt;code>clip&lt;/code> library to load the model and obtain image feature vectors.&lt;/p>
&lt;p>First, install the necessary libraries:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install torch clip
&lt;/code>&lt;/pre>
&lt;p>Then, you can use the following code:&lt;/p>
&lt;pre>&lt;code class="language-python">import torch
import clip
from PIL import Image
# Load the model, can run on either CPU or GPU
device = &amp;quot;cuda&amp;quot; if torch.cuda.is_available() else &amp;quot;cpu&amp;quot;
model, preprocess = clip.load(&amp;quot;ViT-B/32&amp;quot;, device=device)
# Load and preprocess the image
image_path = &amp;quot;cat.jpg&amp;quot; # Replace with your image path
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Prepare text descriptions
text_descriptions = [&amp;quot;a photo of a cat&amp;quot;, &amp;quot;a photo of a dog&amp;quot;]
text_tokens = clip.tokenize(text_descriptions).to(device)
with torch.no_grad():
# Encode images and text
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Calculate similarity
logits_per_image, logits_per_text = model(image, text_tokens)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print(&amp;quot;Label probs:&amp;quot;, probs) # Output the matching probability between the image and each text description
&lt;/code>&lt;/pre>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Through its innovative contrastive learning method, CLIP successfully connects text and images in a shared representation space, demonstrating powerful zero-shot learning capabilities. It has not only achieved excellent results in multiple benchmark tests but has also opened new paths for the development of multimodal artificial intelligence.&lt;/p>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Strong generalization ability and zero-shot performance.&lt;/li>
&lt;li>No need for fine-tuning for specific tasks, saving significant annotation costs.&lt;/li>
&lt;li>Can understand complex and abstract text descriptions.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Limitations&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>May perform poorly on very fine-grained classification tasks (such as identifying specific bird species).&lt;/li>
&lt;li>Limited understanding of abstract or systematic concepts (such as counting).&lt;/li>
&lt;li>The model's performance is highly dependent on the quality and scale of pre-training data.&lt;/li>
&lt;/ul>
&lt;p>Despite some limitations, CLIP remains one of the most important breakthroughs in artificial intelligence in recent years and continues to push the boundaries of multimodal research.&lt;/p></description></item><item><title>Mixture of Experts (MoE): Sparse Activation Architecture for Large-Scale Neural Networks</title><link>https://ziyanglin.netlify.app/en/post/moe-documentation/</link><pubDate>Fri, 27 Jun 2025 04:02:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/moe-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Mixture of Experts (MoE) is a neural network architecture that dramatically expands model capacity without significantly increasing computational costs by decomposing large models into multiple smaller &amp;ldquo;expert&amp;rdquo; networks and using a &amp;ldquo;gating&amp;rdquo; network to dynamically select the most appropriate subset of experts for each input.&lt;/p>
&lt;p>This approach draws inspiration from expert systems in human society, where specific problems are directed to relevant specialists. In deep learning, this means the model can learn to route different inputs to expert networks specialized in processing that type of data, enabling more efficient and specialized learning.&lt;/p>
&lt;h2 id="2-core-components-macro-and-micro-analysis">2. Core Components: Macro and Micro Analysis&lt;/h2>
&lt;p>From a macro perspective, MoE layers typically serve as efficient alternatives to standard Feed-Forward Network (FFN) layers in Transformer models. While traditional FFN layers apply identical transformations to every token in a sequence, MoE layers introduce the concept of &lt;strong>Conditional Computation&lt;/strong>: for each token, the model dynamically selects a small subset of &amp;ldquo;expert&amp;rdquo; networks to process it, rather than engaging the entire model's parameters. This mechanism allows models to maintain relatively constant computation costs despite having enormous parameter counts.&lt;/p>
&lt;p>An MoE layer consists of two core components: &lt;strong>Expert Networks&lt;/strong> and a &lt;strong>Gating Network&lt;/strong>.
Below is a visualization of the macro architecture of an MoE layer:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[Input Token] --&amp;gt; B{Gating Network};
B -- Routing Decision --&amp;gt; C1[Expert 1];
B -- Routing Decision --&amp;gt; C2[Expert 2];
B -- ... --&amp;gt; Cn[Expert n];
C1 --&amp;gt; D[Output];
C2 --&amp;gt; D;
Cn --&amp;gt; D;
&lt;/code>&lt;/pre>
&lt;p>An MoE layer consists of two core components: &lt;strong>Expert Networks&lt;/strong> and a &lt;strong>Gating Network&lt;/strong>.&lt;/p>
&lt;h3 id="21-expert-networks-specialized-processors">2.1. Expert Networks: Specialized Processors&lt;/h3>
&lt;h4 id="underlying-structure-and-variants">Underlying Structure and Variants&lt;/h4>
&lt;p>At the foundational level, each &amp;ldquo;expert&amp;rdquo; is typically an independent feed-forward neural network (FFN). In standard Transformer architectures, an FFN usually consists of two linear layers and a non-linear activation function (such as GeLU or SwiGLU).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Homogeneous Experts&lt;/strong>: In most MoE models, all experts share identical network structures. For example, in the Mixtral 8x7B model, each MoE layer contains 8 structurally identical expert FFNs. This design facilitates implementation and optimization.&lt;/li>
&lt;li>&lt;strong>Heterogeneous Experts&lt;/strong>: Though less common, experts can theoretically be heterogeneous, using different activation functions, hidden layer dimensions, or even more complex structures (like convolutional layers). This might allow the model to learn more diverse features but increases implementation complexity.&lt;/li>
&lt;/ul>
&lt;h4 id="functional-specialization-from-general-to-specialized">Functional Specialization: From General to Specialized&lt;/h4>
&lt;p>During training, although all experts start identical, the routing mechanism of the gating network guides them to develop different &amp;ldquo;specializations.&amp;rdquo; For example, in natural language processing tasks, after sufficient training, we might observe:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Grammar Experts&lt;/strong>: Specialized in processing tokens related to sentence structure, parts of speech, etc.&lt;/li>
&lt;li>&lt;strong>Semantic Experts&lt;/strong>: Focused on understanding word meanings and contextual relationships.&lt;/li>
&lt;li>&lt;strong>Domain-Specific Knowledge Experts&lt;/strong>: For instance, one expert might specialize in &amp;ldquo;legal&amp;rdquo; text, while another becomes more sensitive to &amp;ldquo;biomedical&amp;rdquo; domain knowledge.&lt;/li>
&lt;/ul>
&lt;p>This functional specialization is a key source of MoE models&amp;rsquo; efficiency, as it allows the model to process specific types of information with dedicated subnetworks rather than using a single large, general network for all information.&lt;/p>
&lt;h3 id="22-gating-network-intelligent-routing-and-dispatch-center">2.2. Gating Network: Intelligent Routing and Dispatch Center&lt;/h3>
&lt;p>The gating network is the core decision-making unit of MoE, responsible for assigning the most appropriate experts to each input token.&lt;/p>
&lt;h4 id="technical-details">Technical Details&lt;/h4>
&lt;p>The gating network implementation is typically concise and efficient. Its workflow is as follows:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Generate Logits&lt;/strong>: For the vector representation &lt;code>x&lt;/code> of an input token (typically the output from a self-attention layer), the gating network calculates routing logits through a simple trainable linear layer &lt;code>W_g&lt;/code>: &lt;code>logits = einsum(&amp;quot;d,de-&amp;gt;e&amp;quot;, x, W_g)&lt;/code>, where &lt;code>d&lt;/code> is the token dimension and &lt;code>e&lt;/code> is the number of experts. This operation produces a vector of length &lt;code>e&lt;/code>, with each element representing the &amp;ldquo;score&amp;rdquo; for the corresponding expert.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Top-K Routing Mechanism&lt;/strong>: To achieve sparse computation, tokens are not sent to all experts. The gating network selects the &lt;code>k&lt;/code> highest scores from the logits vector. This &lt;code>k&lt;/code> value is an important hyperparameter; in Mixtral 8x7B, &lt;code>k=2&lt;/code>. This means each token is processed by only the two most relevant experts.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Calculate Gating Weights (Softmax)&lt;/strong>: The selected &lt;code>k&lt;/code> logits are normalized through a Softmax function, generating &lt;code>k&lt;/code> gating weights that determine how to combine the outputs of these &lt;code>k&lt;/code> experts.
&lt;code>weights = softmax(top_k_logits)&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Calculate Final Output&lt;/strong>: The input token &lt;code>x&lt;/code> is sent to the selected &lt;code>k&lt;/code> experts, producing &lt;code>k&lt;/code> expert outputs. The final output is the weighted sum of these &lt;code>k&lt;/code> expert outputs, with weights being the gating weights calculated in the previous step.
&lt;code>output = sum(weights[i] * expert_i(x) for i in top_k_indices)&lt;/code>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Below is a visualization of this workflow:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Input Token x] --&amp;gt; B{Multiply by Gating Weight Matrix W_g};
B --&amp;gt; C{Calculate Logits};
C --&amp;gt; D{Top-K Selection};
D -- k highest scores --&amp;gt; E{Softmax};
E -- Normalized weights --&amp;gt; F[Weighted Sum];
A -- Send to Top-K experts --&amp;gt; G1[&amp;quot;Expert i processes x&amp;quot;];
A -- Send to Top-K experts --&amp;gt; G2[&amp;quot;Expert j processes x&amp;quot;];
G1 --&amp;gt; F;
G2 --&amp;gt; F;
F --&amp;gt; H[Final Output];
&lt;/code>&lt;/pre>
&lt;h4 id="key-challenge-load-balancing">Key Challenge: Load Balancing&lt;/h4>
&lt;p>A critical challenge for the gating network is the &amp;ldquo;Matthew Effect&amp;rdquo;: some experts may receive more training opportunities due to slightly higher initial weights, becoming stronger and subsequently being selected more frequently, causing other experts to be &amp;ldquo;starved.&amp;rdquo; To address this issue, MoE introduces an &lt;strong>Auxiliary Load Balancing Loss&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Principle&lt;/strong>: This loss function aims to encourage the gating network to distribute tokens as evenly as possible across all experts. It's typically implemented by calculating the square sum of the proportion of tokens assigned to each expert in a batch, multiplied by an adjustable hyperparameter &lt;code>α&lt;/code>. The loss value increases as the distribution becomes more unbalanced.&lt;/li>
&lt;li>&lt;strong>Optimization&lt;/strong>: This auxiliary loss is added to the model's main task loss (such as cross-entropy loss for language models) to form the final total loss function. By optimizing both losses during backpropagation, the model is incentivized to maintain load balance among experts while completing its main task.&lt;/li>
&lt;/ul>
&lt;h2 id="3-moe-model-training-methods-addressing-scale-challenges">3. MoE Model Training Methods: Addressing Scale Challenges&lt;/h2>
&lt;p>Due to the enormous parameter count of MoE models (despite sparse computation), their training poses significant challenges to computational resources, especially memory. To effectively train MoE models, complex parallelization strategies must be employed.&lt;/p>
&lt;h3 id="31-expert-parallelism">3.1. Expert Parallelism&lt;/h3>
&lt;p>This is the core parallelization strategy for training MoE models.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Distribute different experts across different computing devices (such as GPUs). For example, in a scenario with an MoE layer containing 8 experts and 8 GPUs, each GPU is responsible for storing and computing one expert. Other parts of the model (such as self-attention layers) can be replicated on each GPU.&lt;/li>
&lt;li>&lt;strong>Workflow and Communication Overhead&lt;/strong>: In each forward pass, tokens from various GPUs, after being processed by the gating network, need to be sent to the GPUs storing the corresponding experts based on routing decisions. This process involves a global &lt;strong>All-to-All&lt;/strong> communication operation, where each GPU needs to send and receive data to and from all other GPUs. After computation, results are sent back to the original GPUs through another All-to-All communication. This intensive communication is the main performance bottleneck in expert parallel mode.&lt;/li>
&lt;/ul>
&lt;h3 id="32-combining-with-other-parallelism-strategies">3.2. Combining with Other Parallelism Strategies&lt;/h3>
&lt;p>To address different scales of models and hardware configurations, expert parallelism often needs to be combined with other parallelism strategies:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Data Parallelism&lt;/strong>: This is the most common parallelism approach. When the number of GPUs exceeds the number of experts, multiple GPUs can form a data parallel group, with each group containing a complete set of experts (distributed through expert parallelism). For example, with 64 GPUs and 8 experts, 8 data parallel groups can be created, each with 8 GPUs, with each GPU responsible for one expert.&lt;/li>
&lt;li>&lt;strong>Model Parallelism and Pipeline Parallelism&lt;/strong>: For ultra-large models where even a single expert or non-MoE layer cannot fit into a single GPU, tensor model parallelism and pipeline parallelism need to be introduced to further split the model.&lt;/li>
&lt;/ul>
&lt;p>In summary, training MoE is a complex multi-dimensional parallel engineering task that requires careful design of parallelism strategies based on factors such as model size, number of experts, number of GPUs, and network bandwidth.&lt;/p>
&lt;h2 id="4-advantages-of-moe">4. Advantages of MoE&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Enormous Model Capacity&lt;/strong>: MoE allows models to have massive parameters (e.g., trillions of parameters) without needing to compute all parameters in each forward pass. This enables the model to learn more complex and detailed knowledge.&lt;/li>
&lt;li>&lt;strong>Controllable Computational Cost&lt;/strong>: Due to the sparse activation strategy (activating only a few experts), the training and inference costs of MoE models are comparable to dense models with far fewer total parameters.&lt;/li>
&lt;li>&lt;strong>Faster Training and Inference&lt;/strong>: Under the same computational budget, MoE models typically converge faster and have faster inference speeds compared to dense models.&lt;/li>
&lt;/ul>
&lt;h2 id="5-challenges-of-moe">5. Challenges of MoE&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Training Instability&lt;/strong>: The gating network may tend to always select a few &amp;ldquo;popular&amp;rdquo; experts, preventing other experts from being adequately trained. To address this issue, a &amp;ldquo;load balancing loss&amp;rdquo; is typically introduced to encourage the gating network to distribute inputs evenly across all experts.&lt;/li>
&lt;li>&lt;strong>High Communication Cost&lt;/strong>: In distributed training, since different experts may be distributed across different computing devices, routing input data from the gating network to selected experts incurs significant communication overhead.&lt;/li>
&lt;li>&lt;strong>Complex Implementation&lt;/strong>: Compared to standard dense models, MoE models are more complex to implement and deploy, requiring specialized parallel computing strategies and hardware support.&lt;/li>
&lt;li>&lt;strong>Memory Consumption&lt;/strong>: Although computation is sparse, all parameters of the model (all experts) need to be stored in memory, placing high demands on hardware.&lt;/li>
&lt;/ul>
&lt;h2 id="6-key-technologies-and-recent-advances">6. Key Technologies and Recent Advances&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Switch Transformers&lt;/strong>: This is a simplified MoE architecture proposed by Google that simplifies the top-k strategy to top-1, meaning each token is routed to only one expert. This design greatly simplifies routing logic and reduces communication costs.&lt;/li>
&lt;li>&lt;strong>GShard&lt;/strong>: This is a system for training MoE models on ultra-large-scale clusters. It effectively addresses the communication bottleneck in MoE training through clever data and model parallelism strategies.&lt;/li>
&lt;li>&lt;strong>Expert Capacity Factor&lt;/strong>: To handle load imbalance issues, a &amp;ldquo;capacity&amp;rdquo; can be set for each expert, defining the maximum number of tokens it can process in a batch. If an expert is selected more times than its capacity, excess tokens will be &amp;ldquo;dropped&amp;rdquo; or routed to other experts.&lt;/li>
&lt;li>&lt;strong>Latest Routing Strategies&lt;/strong>: Researchers are exploring more advanced routing strategies, such as allowing tokens to be routed to multiple experts with weighted combination of their outputs, or using more complex gating networks to make smarter routing decisions.&lt;/li>
&lt;li>&lt;strong>Applications in Computer Vision&lt;/strong>: MoE is not limited to NLP; it has also been successfully applied to computer vision tasks such as pose estimation, enhancing model performance by training specialized experts for different datasets or pose types.&lt;/li>
&lt;/ul>
&lt;h2 id="7-summary-and-outlook">7. Summary and Outlook&lt;/h2>
&lt;p>MoE models have successfully achieved massive model scaling at controllable computational costs by introducing sparsely activated expert networks, becoming a key technology for building ultra-large-scale language and vision models.&lt;/p>
&lt;p>Despite challenges in training stability and communication overhead, with the continued maturation of technologies like Switch Transformers and GShard, as well as the emergence of new routing strategies and hardware optimizations, the application prospects for MoE are increasingly broad. In the future, we can expect to see more, larger, and more efficient MoE models playing important roles across various domains.&lt;/p></description></item><item><title>LLM Hyperparameter Tuning Guide: A Comprehensive Analysis from Generation to Deployment</title><link>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</link><pubDate>Fri, 27 Jun 2025 03:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h2 id="span-stylefontsize-09embehind-the-powerful-capabilities-of-large-language-models-llms-is-a-series-of-complex-hyperparameters-working-silently-whether-youre-deploying-a-local-inference-service-like-vllm-or-calling-openais-api-precisely-tuning-these-parameters-is-crucial-for-achieving-ideal-performance-cost-and-output-quality-this-document-provides-a-detailed-analysis-of-two-key-categories-of-hyperparameters-generation-sampling-parameters-and-deployment-serving-parameters-helping-you-fully-master-their-functions-values-impacts-and-best-practices-across-different-scenariosspan">&lt;span style="font-size: 0.9em;">Behind the powerful capabilities of large language models (LLMs) is a series of complex hyperparameters working silently. Whether you're deploying a local inference service like vLLM or calling OpenAI's API, precisely tuning these parameters is crucial for achieving ideal performance, cost, and output quality. This document provides a detailed analysis of two key categories of hyperparameters: &lt;strong>Generation (Sampling) Parameters&lt;/strong> and &lt;strong>Deployment (Serving) Parameters&lt;/strong>, helping you fully master their functions, values, impacts, and best practices across different scenarios.&lt;/span>&lt;/h2>
&lt;h3 id="part-1-generation-sampling-parameters--controlling-model-creativity-and-determinism">Part 1: Generation (Sampling) Parameters — Controlling Model Creativity and Determinism&lt;/h3>
&lt;p>Generation parameters directly control the model's behavior when generating the next token. They primarily revolve around a core question: how to select from thousands of possible next words in the probability distribution provided by the model.&lt;/p>
&lt;h3 id="1-temperature">1. &lt;code>temperature&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the randomness of generated text. Higher &lt;code>temperature&lt;/code> increases randomness, making responses more creative and diverse; lower &lt;code>temperature&lt;/code> decreases randomness, making responses more deterministic and conservative.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
When generating the next token, the model calculates &lt;code>logits&lt;/code> (raw, unnormalized prediction scores) for all words in the vocabulary. Typically, we use the &lt;code>Softmax&lt;/code> function to convert these &lt;code>logits&lt;/code> into a probability distribution. The &lt;code>temperature&lt;/code> parameter is introduced before the &lt;code>Softmax&lt;/code> calculation, &amp;ldquo;smoothing&amp;rdquo; or &amp;ldquo;sharpening&amp;rdquo; this probability distribution.&lt;/p>
&lt;p>The standard Softmax formula is: &lt;code>P(i) = exp(logit_i) / Σ_j(exp(logit_j))&lt;/code>&lt;/p>
&lt;p>With &lt;code>temperature&lt;/code> (T) introduced, the formula becomes: &lt;code>P(i) = exp(logit_i / T) / Σ_j(exp(logit_j / T))&lt;/code>&lt;/p>
&lt;ul>
&lt;li>When &lt;code>T&lt;/code> -&amp;gt; 0, the differences in &lt;code>logit_i / T&lt;/code> become dramatically amplified. The token with the highest logit approaches a probability of 1, while all other tokens approach 0. This causes the model to almost always choose the most likely word, behaving very deterministically and &amp;ldquo;greedily.&amp;rdquo;&lt;/li>
&lt;li>When &lt;code>T&lt;/code> = 1, the formula reverts to standard Softmax, and the model behaves in its &amp;ldquo;original&amp;rdquo; state.&lt;/li>
&lt;li>When &lt;code>T&lt;/code> &amp;gt; 1, the differences in &lt;code>logit_i / T&lt;/code> are reduced. Tokens with originally lower probabilities get boosted, making the entire probability distribution &amp;ldquo;flatter.&amp;rdquo; This increases the chance of selecting less common words, introducing more randomness and creativity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>[0.0, 2.0]&lt;/code> (theoretically can be higher, but OpenAI API typically limits to 2.0).&lt;/li>
&lt;li>&lt;strong>&lt;code>temperature&lt;/code> = 0.0:&lt;/strong> Suitable for scenarios requiring deterministic, reproducible, and highly accurate outputs. Examples: code generation, factual Q&amp;amp;A, text classification, data extraction. With identical inputs, outputs will be almost identical (unless the model itself is updated).&lt;/li>
&lt;li>&lt;strong>Low &lt;code>temperature&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.4&lt;/code>):&lt;/strong> Suitable for semi-creative tasks requiring rigor and fidelity to source material. Examples: article summarization, translation, customer service bots. Outputs will vary slightly but remain faithful to core content.&lt;/li>
&lt;li>&lt;strong>Medium &lt;code>temperature&lt;/code> (e.g., &lt;code>0.5&lt;/code> - &lt;code>0.8&lt;/code>):&lt;/strong> A good balance between creativity and consistency, recommended as the default for most applications. Examples: writing emails, marketing copy, brainstorming.&lt;/li>
&lt;li>&lt;strong>High &lt;code>temperature&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.5&lt;/code>):&lt;/strong> Suitable for highly creative tasks. Examples: poetry writing, story creation, dialogue script generation. Outputs will be very diverse and sometimes surprising, but may occasionally produce meaningless or incoherent content.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Note:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>It's generally not recommended to modify both &lt;code>temperature&lt;/code> and &lt;code>top_p&lt;/code> simultaneously; it's better to adjust just one. OpenAI's documentation explicitly states that modifying only one is typically advised.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-topp-nucleus-sampling">2. &lt;code>top_p&lt;/code> (Nucleus Sampling)&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls generation diversity by dynamically determining the sampling pool size through a cumulative probability threshold (&lt;code>p&lt;/code>) of the highest probability tokens.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
&lt;code>top_p&lt;/code> is a more intelligent sampling strategy than &lt;code>temperature&lt;/code>, also known as &lt;strong>Nucleus Sampling&lt;/strong>. Instead of adjusting all token probabilities, it directly defines a &amp;ldquo;core&amp;rdquo; candidate set.&lt;/p>
&lt;p>The specific steps are as follows:&lt;/p>
&lt;ol>
&lt;li>The model calculates the probability distribution for all candidate tokens.&lt;/li>
&lt;li>All tokens are sorted by probability from highest to lowest.&lt;/li>
&lt;li>Starting from the highest probability token, their probabilities are cumulatively added until this sum exceeds the set &lt;code>top_p&lt;/code> threshold.&lt;/li>
&lt;li>All tokens included in this cumulative sum form the &amp;ldquo;nucleus&amp;rdquo; for sampling.&lt;/li>
&lt;li>The model will only sample from this nucleus (typically renormalizing their probabilities), and all other tokens are ignored.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Example:&lt;/strong> Assume &lt;code>top_p&lt;/code> = &lt;code>0.9&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>If the highest probability token &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.95&lt;/code>, then the nucleus will contain only &amp;ldquo;the&amp;rdquo;, and the model will choose it 100%.&lt;/li>
&lt;li>If &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.5&lt;/code>, &amp;ldquo;a&amp;rdquo; has &lt;code>0.3&lt;/code>, and &amp;ldquo;an&amp;rdquo; has &lt;code>0.1&lt;/code>, then the cumulative probability of these three words is &lt;code>0.9&lt;/code>. The nucleus will contain {&amp;ldquo;the&amp;rdquo;, &amp;ldquo;a&amp;rdquo;, &amp;ldquo;an&amp;rdquo;}. The model will sample from these three words according to their (renormalized) probabilities.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_p&lt;/code> = 1.0:&lt;/strong> Means the model considers all tokens without any truncation (equivalent to no &lt;code>top_p&lt;/code>).&lt;/li>
&lt;li>&lt;strong>High &lt;code>top_p&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.0&lt;/code>):&lt;/strong> Allows for more diverse choices, suitable for creative tasks, similar in effect to higher &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Low &lt;code>top_p&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.3&lt;/code>):&lt;/strong> Greatly restricts the model's range of choices, making its output very deterministic and conservative, similar in effect to extremely low &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>General Recommended Value:&lt;/strong> &lt;code>0.9&lt;/code> is a very common default value as it maintains high quality while allowing for some diversity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>top_p&lt;/code> vs &lt;code>temperature&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>top_p&lt;/code> is more dynamic and adaptive. When the model is very confident about the next step (sharp probability distribution), &lt;code>top_p&lt;/code> automatically narrows the candidate set, ensuring quality. When the model is less confident (flat distribution), it expands the candidate set, increasing diversity.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code> adjusts the entire distribution &amp;ldquo;equally,&amp;rdquo; regardless of whether the distribution itself is sharp or flat.&lt;/li>
&lt;li>Therefore, &lt;code>top_p&lt;/code> is generally considered a safer and more robust method for controlling diversity than &lt;code>temperature&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-topk">3. &lt;code>top_k&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Simply and directly samples only from the &lt;code>k&lt;/code> tokens with the highest probabilities.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> This is the simplest truncation sampling method. It directly selects the &lt;code>k&lt;/code> tokens with the highest probabilities to form the candidate set, then samples from these &lt;code>k&lt;/code> tokens. All other tokens are ignored.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Integers, such as &lt;code>1&lt;/code>, &lt;code>10&lt;/code>, &lt;code>50&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_k&lt;/code> = 1:&lt;/strong> Equivalent to greedy search, always choosing the most likely word.&lt;/li>
&lt;li>&lt;strong>Recommendation:&lt;/strong> &lt;code>top_k&lt;/code> is typically not the preferred sampling strategy because it's too &amp;ldquo;rigid.&amp;rdquo; In cases where the probability distribution is very flat, it might accidentally exclude many reasonable words; while in cases where the distribution is very sharp, it might include many extremely low-probability, useless words. &lt;code>top_p&lt;/code> is usually a better choice.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-repetitionpenalty">4. &lt;code>repetition_penalty&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Applies a penalty to tokens that have already appeared in the context, reducing their probability of being selected again, thereby reducing repetitive content.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> After calculating &lt;code>logits&lt;/code> but before &lt;code>Softmax&lt;/code>, this parameter iterates through all candidate tokens. If a token has already appeared in the previous context, its &lt;code>logit&lt;/code> value is reduced (typically divided by the value of &lt;code>repetition_penalty&lt;/code>).&lt;/p>
&lt;p>&lt;code>new_logit = logit / penalty&lt;/code> (if token has appeared)
&lt;code>new_logit = logit&lt;/code> (if token has not appeared)&lt;/p>
&lt;p>This way, the final probability of words that have already appeared decreases.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>1.0&lt;/code> to &lt;code>2.0&lt;/code> is common.&lt;/li>
&lt;li>&lt;strong>&lt;code>1.0&lt;/code>:&lt;/strong> No penalty applied (default value).&lt;/li>
&lt;li>&lt;strong>&lt;code>1.1&lt;/code> - &lt;code>1.3&lt;/code>:&lt;/strong> A relatively safe range that can effectively reduce unnecessary repetition without overly affecting normal language expression (such as necessary articles like &amp;ldquo;the&amp;rdquo;).&lt;/li>
&lt;li>&lt;strong>Too High Values:&lt;/strong> May cause the model to deliberately avoid common words, producing unnatural or even strange sentences.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="5-frequencypenalty--presencepenalty">5. &lt;code>frequency_penalty&lt;/code> &amp;amp; &lt;code>presence_penalty&lt;/code>&lt;/h3>
&lt;p>These two parameters are more refined versions of &lt;code>repetition_penalty&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>presence_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Applies a fixed penalty to all tokens that have &lt;strong>appeared at least once&lt;/strong> in the context. It doesn't care how many times the token has appeared; as long as it has appeared, it gets penalized.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - presence_penalty&lt;/code> (if token has appeared at least once).&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is useful when you want to encourage the model to introduce entirely new concepts and vocabulary, rather than repeatedly discussing topics that have already been mentioned.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>. Positive values penalize new tokens, negative values encourage them.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>frequency_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> The penalty is proportional to the &lt;strong>frequency&lt;/strong> of the token in the context. The more times a word appears, the heavier the penalty it receives.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - count(token) * frequency_penalty&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is effective when you find the model tends to repeatedly use certain specific high-frequency words (even if they are necessary), leading to monotonous language.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Summary:&lt;/strong> &lt;code>presence_penalty&lt;/code> addresses the question of &amp;ldquo;whether it has appeared,&amp;rdquo; while &lt;code>frequency_penalty&lt;/code> addresses &amp;ldquo;how many times it has appeared.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="6-seed">6. &lt;code>seed&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> By providing a fixed &lt;code>seed&lt;/code>, you can make the model's output reproducible when other parameters (such as &lt;code>temperature&lt;/code>) remain the same.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> In machine learning, many operations that seem random are actually &amp;ldquo;pseudo-random,&amp;rdquo; determined by an initial &amp;ldquo;seed.&amp;rdquo; Setting the same seed will produce the same sequence of random numbers. In LLMs, this means the sampling process will be completely deterministic.&lt;/li>
&lt;li>&lt;strong>Scenarios:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Debugging and Testing:&lt;/strong> When you need to verify whether a change has affected the output, fixing the &lt;code>seed&lt;/code> can eliminate randomness interference.&lt;/li>
&lt;li>&lt;strong>Reproducible Research:&lt;/strong> Reproducibility is crucial in academic research.&lt;/li>
&lt;li>&lt;strong>Generating Consistent Content:&lt;/strong> When you need the model to consistently produce outputs in the same style for the same input.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Note:&lt;/strong> For complete reproduction, &lt;strong>all&lt;/strong> generation parameters (&lt;code>prompt&lt;/code>, &lt;code>model&lt;/code>, &lt;code>temperature&lt;/code>, &lt;code>top_p&lt;/code>, etc.) must be identical.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="part-2-deployment-serving-parameters--optimizing-service-performance-and-capacity">Part 2: Deployment (Serving) Parameters — Optimizing Service Performance and Capacity&lt;/h3>
&lt;p>Deployment parameters determine how an LLM inference service manages GPU resources, handles concurrent requests, and optimizes overall throughput and latency. These parameters are particularly important in high-performance inference engines like vLLM.&lt;/p>
&lt;h3 id="1-gpumemoryutilization">1. &lt;code>gpu_memory_utilization&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the proportion of GPU memory that vLLM can use, with the core purpose of reserving space for the &lt;strong>KV Cache&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle (PagedAttention):&lt;/strong>
The core of vLLM is the PagedAttention mechanism. Traditional attention mechanisms pre-allocate a continuous, maximum-length memory space for each request to store the Key-Value (KV) Cache. This leads to severe memory waste, as most requests are far shorter than the maximum length.&lt;/p>
&lt;p>PagedAttention manages the KV Cache like virtual memory in an operating system:&lt;/p>
&lt;ol>
&lt;li>It breaks down each sequence's KV Cache into many small, fixed-size &amp;ldquo;blocks.&amp;rdquo;&lt;/li>
&lt;li>These blocks can be stored non-contiguously in GPU memory.&lt;/li>
&lt;li>A central &amp;ldquo;Block Manager&amp;rdquo; is responsible for allocating and releasing these blocks.&lt;/li>
&lt;/ol>
&lt;p>&lt;code>gpu_memory_utilization&lt;/code> tells vLLM: &amp;ldquo;You can use this much proportion of the total GPU memory for free management (mainly storing model weights and physical blocks of KV Cache).&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Default Value:&lt;/strong> &lt;code>0.9&lt;/code> (i.e., 90%).&lt;/li>
&lt;li>&lt;strong>Higher Values (e.g., &lt;code>0.95&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> vLLM has more memory for KV Cache, supporting longer contexts and larger batch sizes, thereby increasing throughput.&lt;/li>
&lt;li>&lt;strong>Risk:&lt;/strong> If set too high, there might not be enough spare memory for CUDA kernels, drivers, or other system processes, easily leading to &lt;strong>OOM (Out of Memory)&lt;/strong> errors.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values (e.g., &lt;code>0.8&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Safer, less prone to OOM, reserves more memory for the system and other applications.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Reduced available space for KV Cache, potentially causing vLLM to struggle with high concurrency or long sequence requests, degrading performance. When KV Cache is insufficient, vLLM triggers &lt;strong>Preemption&lt;/strong>, swapping out some running sequences and waiting to swap them back in when there's enough space, severely affecting latency. vLLM's warning log &lt;code>&amp;quot;there is not enough KV cache space. This can affect the end-to-end performance.&amp;quot;&lt;/code> is reminding you of this issue.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Start with the default value of &lt;code>0.9&lt;/code>.&lt;/li>
&lt;li>If you encounter OOM, gradually lower this value.&lt;/li>
&lt;li>If you encounter many preemption warnings and confirm no other processes are occupying large amounts of GPU memory, you can gradually increase this value.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-maxnumseqs">2. &lt;code>max_num_seqs&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Limits the maximum number of sequences (requests) that the vLLM scheduler can process &lt;strong>in one iteration (or one batch)&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
vLLM's scheduler selects a batch of requests from the waiting queue in each processing cycle. This parameter directly limits the size of this &amp;ldquo;batch.&amp;rdquo; Together with &lt;code>max_num_batched_tokens&lt;/code> (which limits the total number of tokens across all sequences in a batch), it determines the scale of batch processing.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, such as &lt;code>16&lt;/code>, &lt;code>64&lt;/code>, &lt;code>256&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Allows for higher concurrency, potentially improving GPU utilization and overall throughput.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Requires more intermediate memory (e.g., for storing &lt;code>logits&lt;/code> and sampling states) and may increase the latency of individual batches. If set too high, even if KV Cache still has space, OOM might occur due to insufficient temporary memory.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> More memory-friendly, potentially lower latency for individual batches.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Limits concurrency capability, potentially leading to underutilization of GPU and decreased throughput.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This value needs to be adjusted based on your GPU memory size, model size, and expected concurrent load.&lt;/li>
&lt;li>For high-concurrency scenarios, try gradually increasing this value while monitoring GPU utilization and memory usage.&lt;/li>
&lt;li>For interactive, low-latency scenarios, consider setting this value lower.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-maxmodellen">3. &lt;code>max_model_len&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Sets the &lt;strong>maximum context length&lt;/strong> the model can process (including both prompt and generated tokens).&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
This parameter directly determines how much logical space vLLM needs to reserve for the KV Cache. For example, if &lt;code>max_model_len&lt;/code> = &lt;code>4096&lt;/code>, vLLM must ensure its memory management mechanism can support storing KV pairs for up to &lt;code>4096&lt;/code> tokens per sequence.
This affects vLLM's memory planning at startup, such as the size of Position Embeddings.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, cannot exceed the maximum length the model was originally trained on.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can handle longer documents and more complex contexts.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> &lt;strong>Significantly increases&lt;/strong> memory consumption. Each token needs to store KV Cache; doubling the length roughly doubles the memory usage. Even if current requests are short, vLLM needs to prepare for potentially long requests, which occupies more KV Cache blocks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> &lt;strong>Significantly saves&lt;/strong> GPU memory. If you know your application scenario will never exceed 1024 tokens, setting this value to 1024 instead of the default 4096 or 8192 will free up a large amount of KV Cache space, supporting higher concurrency.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Any requests exceeding this length will be rejected or truncated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Set as needed!&lt;/strong> This is one of the most effective parameters for optimizing vLLM memory usage. Based on your actual application scenario, set this value to a reasonable maximum with some margin.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-tensorparallelsize--pipelineparallelsize">4. &lt;code>tensor_parallel_size&lt;/code> &amp;amp; &lt;code>pipeline_parallel_size&lt;/code>&lt;/h3>
&lt;p>These two parameters are used for deploying extremely large models across multiple GPUs or nodes.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>tensor_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Divides &lt;strong>each layer&lt;/strong> of the model (such as a large weight matrix) into &lt;code>N&lt;/code> parts (&lt;code>N&lt;/code> = &lt;code>tensor_parallel_size&lt;/code>), placing them on &lt;code>N&lt;/code> different GPUs. During computation, each GPU only processes its own portion of the data, then exchanges necessary results through high-speed interconnects (like NVLink) via All-Reduce operations, finally merging to get the complete output.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when a single model's volume exceeds the memory of a single GPU. For example, a 70B model cannot fit into a single 40GB A100, but can be deployed across two A100s by setting &lt;code>tensor_parallel_size=2&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Achieves model parallelism, solving the problem of models not fitting on a single card.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Introduces significant cross-GPU communication overhead, potentially affecting latency. Requires high-speed interconnects between GPUs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>pipeline_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Assigns &lt;strong>different layers&lt;/strong> of the model to different GPUs or nodes. For example, placing layers 1-10 on GPU 1, layers 11-20 on GPU 2, and so on. Data flows through these GPUs like a pipeline.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when the model is extremely large and needs to be deployed across multiple nodes (machines).&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can scale the model to any number of GPUs/nodes.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Creates &amp;ldquo;pipeline bubbles&amp;rdquo; as additional overhead, where some GPUs are idle during the start and end phases of the pipeline, reducing utilization.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Combined Use:&lt;/strong>
vLLM supports using both parallelism strategies simultaneously for efficient deployment of giant models on large clusters.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="summary-and-best-practices">Summary and Best Practices&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Scenario&lt;/th>
&lt;th align="left">&lt;code>temperature&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>top_p&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>repetition_penalty&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>gpu_memory_utilization&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_num_seqs&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_model_len&lt;/code>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Code Generation/Factual Q&amp;amp;A&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.0&lt;/code> - &lt;code>0.2&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Article Summarization/Translation&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.2&lt;/code> - &lt;code>0.5&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set to maximum possible document length&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>General Chat/Copywriting&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.7&lt;/code> (Default)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Recommended)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed, e.g., &lt;code>4096&lt;/code>|&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Creative Writing/Brainstorming&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.8&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>High Concurrency Throughput Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Try &lt;code>0.9&lt;/code> - &lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">Gradually increase&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Low Latency Interaction Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Set to lower values (e.g., &lt;code>16-64&lt;/code>)&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Extremely Memory Constrained&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Lower to &lt;code>0.8&lt;/code>&lt;/td>
&lt;td align="left">Set to lower values&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Final Recommendations:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Start with Generation Parameters:&lt;/strong> First adjust &lt;code>temperature&lt;/code> or &lt;code>top_p&lt;/code> to achieve satisfactory output quality.&lt;/li>
&lt;li>&lt;strong>Set Deployment Parameters as Needed:&lt;/strong> When deploying, first set &lt;code>max_model_len&lt;/code> to a reasonable minimum value based on your application scenario.&lt;/li>
&lt;li>&lt;strong>Monitor and Iterate:&lt;/strong> Start with the default &lt;code>gpu_memory_utilization=0.9&lt;/code> and a moderate &lt;code>max_num_seqs&lt;/code>. Observe memory usage and preemption situations through monitoring tools (such as &lt;code>nvidia-smi&lt;/code> and vLLM logs), then gradually adjust these values to find the optimal balance for your specific hardware and workload.&lt;/li>
&lt;/ol></description></item><item><title>Model Quantization Guide: A Comprehensive Analysis from Theory to Practice</title><link>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</link><pubDate>Fri, 27 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>As large language models (LLMs) continue to grow in scale and complexity, their deployment and inference costs have become increasingly expensive. Model quantization, as a key optimization technique, significantly reduces model storage requirements, memory consumption, and computational load by lowering the numerical precision of model weights and activation values, enabling efficient inference on resource-constrained devices such as mobile and edge devices.&lt;/p>
&lt;p>This document aims to provide a clear and comprehensive introduction to the core concepts of deep learning model quantization, mainstream approaches, and specific implementations in two leading inference frameworks—&lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>. We will explore in detail the quantization types they support, underlying principles, usage methods, and future trends in quantization technology.&lt;/p>
&lt;h2 id="2-quantization-fundamentals">2. Quantization Fundamentals&lt;/h2>
&lt;p>Before diving into specific frameworks, we need to understand some basic concepts of quantization.&lt;/p>
&lt;h3 id="21-what-is-model-quantization">2.1 What is Model Quantization?&lt;/h3>
&lt;p>Model quantization refers to the process of converting floating-point numbers in a model (typically 32-bit floating-point, or &lt;code>FP32&lt;/code>) to integers with fewer bits (such as &lt;code>INT8&lt;/code>, &lt;code>INT4&lt;/code>) or lower-precision floating-point numbers (such as &lt;code>FP16&lt;/code>, &lt;code>FP8&lt;/code>). This process is essentially a form of information compression that attempts to significantly reduce model complexity while preserving model accuracy as much as possible.&lt;/p>
&lt;h3 id="22-why-is-quantization-needed">2.2 Why is Quantization Needed?&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size&lt;/strong>: Lower bit-width numerical representations can significantly reduce the size of model files. For example, quantizing an &lt;code>FP32&lt;/code> model to &lt;code>INT8&lt;/code> can reduce the model size by approximately 4 times.&lt;/li>
&lt;li>&lt;strong>Lower Memory Bandwidth&lt;/strong>: Smaller data types mean less bandwidth is occupied when transferring data between memory and computational units, which is crucial for memory bandwidth-sensitive hardware.&lt;/li>
&lt;li>&lt;strong>Accelerated Computation&lt;/strong>: Many modern processors (CPUs, GPUs, TPUs) support integer operations more efficiently than floating-point operations, providing higher throughput and lower latency.&lt;/li>
&lt;li>&lt;strong>Reduced Power Consumption&lt;/strong>: Integer operations typically consume less energy than floating-point operations.&lt;/li>
&lt;/ul>
&lt;h3 id="23-quantization-principles-mapping-and-dequantization">2.3 Quantization Principles: Mapping and Dequantization&lt;/h3>
&lt;p>The core of quantization is mapping a larger range of floating-point values to a smaller range of fixed-point integer values. This process is defined by the following formula:&lt;/p>
&lt;pre>&lt;code>Q(r) = round(r / S + Z)
&lt;/code>&lt;/pre>
&lt;p>Where:&lt;/p>
&lt;ul>
&lt;li>&lt;code>r&lt;/code> is the original floating-point value.&lt;/li>
&lt;li>&lt;code>Q(r)&lt;/code> is the quantized integer value.&lt;/li>
&lt;li>&lt;code>S&lt;/code> is the &lt;strong>Scale factor&lt;/strong>, representing the floating-point value size corresponding to each quantized integer step.&lt;/li>
&lt;li>&lt;code>Z&lt;/code> is the &lt;strong>Zero-point&lt;/strong>, representing the quantized integer value corresponding to floating-point zero.&lt;/li>
&lt;/ul>
&lt;p>When performing calculations, the quantized values need to be dequantized back to the floating-point domain:&lt;/p>
&lt;pre>&lt;code>r' = S * (Q(r) - Z)
&lt;/code>&lt;/pre>
&lt;p>&lt;code>r'&lt;/code> is the dequantized floating-point number, which has some quantization error compared to the original value &lt;code>r&lt;/code>.&lt;/p>
&lt;h3 id="24-symmetric-vs-asymmetric-quantization">2.4 Symmetric vs. Asymmetric Quantization&lt;/h3>
&lt;p>Based on the choice of zero-point, quantization can be divided into two modes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Symmetric Quantization&lt;/strong>: Maps the floating-point range &lt;code>[-abs_max, abs_max]&lt;/code> symmetrically to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is typically 0 (for signed integers) or &lt;code>2^(bits-1)&lt;/code> (for unsigned integer offset). Computation is relatively simple.&lt;/li>
&lt;li>&lt;strong>Asymmetric Quantization&lt;/strong>: Maps the complete floating-point range &lt;code>[min, max]&lt;/code> to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is a floating-point number that can be adjusted according to data distribution. It can more accurately represent asymmetrically distributed data but is slightly more complex in computation.&lt;/li>
&lt;/ul>
&lt;h3 id="25-perlayer-vs-pergroupperchannel-quantization">2.5 Per-Layer vs. Per-Group/Per-Channel Quantization&lt;/h3>
&lt;p>The granularity of calculating scale factor &lt;code>S&lt;/code> and zero-point &lt;code>Z&lt;/code> also affects quantization accuracy:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Per-Layer/Per-Tensor&lt;/strong>: The entire weight tensor (or all weights in a layer) shares the same set of &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This approach is the simplest, but if the value distribution within the tensor is uneven, it may lead to larger errors.&lt;/li>
&lt;li>&lt;strong>Per-Channel&lt;/strong>: For weights in convolutional layers, each output channel uses independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Grouped Quantization&lt;/strong>: The weight tensor is divided into several groups, with each group using independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This is currently a very popular approach in LLM quantization as it achieves a good balance between accuracy and overhead. The group size is a key hyperparameter.&lt;/li>
&lt;/ul>
&lt;h3 id="26-common-quantization-paradigms">2.6 Common Quantization Paradigms&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: This is the most commonly used and convenient quantization method. It is performed after the model has been fully trained, without requiring retraining. PTQ typically needs a small calibration dataset to calculate the optimal quantization parameters (&lt;code>S&lt;/code> and &lt;code>Z&lt;/code>) by analyzing the distribution of weights and activation values.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: This simulates the errors introduced by quantization during the model training process. By inserting pseudo-quantization nodes in the forward pass during training, it allows the model to adapt to the accuracy loss caused by quantization. QAT typically achieves higher accuracy than PTQ but requires a complete training process and dataset, making it more costly.&lt;/li>
&lt;/ul>
&lt;p>Now that we have the basic knowledge of quantization, let's delve into the specific implementations in &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>.&lt;/p>
&lt;h2 id="3-quantization-schemes-in-llamacpp">3. Quantization Schemes in llama.cpp&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is an efficient LLM inference engine written in C/C++, renowned for its excellent cross-platform performance and support for resource-constrained devices. One of its core advantages is its powerful and flexible quantization support, which revolves around its self-developed &lt;code>GGUF&lt;/code> (Georgi Gerganov Universal Format) file format.&lt;/p>
&lt;h3 id="31-gguf-format-and-quantization">3.1 GGUF Format and Quantization&lt;/h3>
&lt;p>GGUF is a binary format specifically designed for LLMs, used to store model metadata, vocabulary, and weights. A key feature is its native support for various quantized weights, allowing different precision tensors to be mixed within the same file. This enables &lt;code>llama.cpp&lt;/code> to directly use quantized weights when loading models, without additional conversion steps.&lt;/p>
&lt;h3 id="32-quantization-type-nomenclature-in-llamacpp">3.2 Quantization Type Nomenclature in &lt;code>llama.cpp&lt;/code>&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> defines a very specific quantization type naming convention, typically in the format &lt;code>Q&amp;lt;bits&amp;gt;_&amp;lt;type&amp;gt;&lt;/code>. Understanding these names is key to mastering &lt;code>llama.cpp&lt;/code> quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q&lt;/code>&lt;/strong>: Represents quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;bits&amp;gt;&lt;/code>&lt;/strong>: Indicates the average number of bits per weight, such as &lt;code>2&lt;/code>, &lt;code>3&lt;/code>, &lt;code>4&lt;/code>, &lt;code>5&lt;/code>, &lt;code>6&lt;/code>, &lt;code>8&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;type&amp;gt;&lt;/code>&lt;/strong>: Indicates the specific quantization method or variant.&lt;/li>
&lt;/ul>
&lt;p>Below are some of the most common quantization types and their explanations:&lt;/p>
&lt;h4 id="321-basic-quantization-types-legacy">3.2.1 Basic Quantization Types (Legacy)&lt;/h4>
&lt;p>These are earlier quantization methods, most of which have now been replaced by &lt;code>K-Quants&lt;/code>, but are still retained for compatibility.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_0&lt;/code>, &lt;code>Q4_1&lt;/code>&lt;/strong>: 4-bit quantization. &lt;code>Q4_1&lt;/code> uses higher precision scale factors than &lt;code>Q4_0&lt;/code>, thus typically achieving higher accuracy.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_0&lt;/code>, &lt;code>Q5_1&lt;/code>&lt;/strong>: 5-bit quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q8_0&lt;/code>&lt;/strong>: 8-bit symmetric quantization using block-wise scale factors. This is one of the quantization types closest to the original &lt;code>FP16&lt;/code> precision and often serves as a benchmark for performance and quality.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q2_K&lt;/code>, &lt;code>Q3_K&lt;/code>, &lt;code>Q4_K&lt;/code>, &lt;code>Q5_K&lt;/code>, &lt;code>Q6_K&lt;/code>&lt;/strong>: These are the &lt;code>K-Quants&lt;/code> series.&lt;/li>
&lt;/ul>
&lt;h4 id="322-kquants-recommended">3.2.2 K-Quants (Recommended)&lt;/h4>
&lt;p>&lt;code>K-Quants&lt;/code> is a more advanced and flexible quantization scheme introduced in &lt;code>llama.cpp&lt;/code>. They achieve better precision preservation at extremely low bit rates through more refined block structures and the concept of super-blocks.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Block&lt;/strong>: Weights are divided into fixed-size blocks (typically 256 weights).&lt;/li>
&lt;li>&lt;strong>Super-block&lt;/strong>: Multiple blocks form a super-block. More detailed quantization parameters (such as min/max scale factors) are stored at the super-block level.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>K-Quants&lt;/code> naming typically includes a suffix like &lt;code>_S&lt;/code>, &lt;code>_M&lt;/code>, &lt;code>_L&lt;/code>, indicating different sizes/complexities:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>S&lt;/code> (Small)&lt;/strong>: The smallest version, typically with the lowest precision.&lt;/li>
&lt;li>&lt;strong>&lt;code>M&lt;/code> (Medium)&lt;/strong>: Medium size, balancing precision and size.&lt;/li>
&lt;li>&lt;strong>&lt;code>L&lt;/code> (Large)&lt;/strong>: The largest version, typically with the highest precision.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Common K-Quants Types:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_K_M&lt;/code>&lt;/strong>: 4-bit K-Quant, medium size. This is currently one of the most commonly used and recommended 4-bit quantization types, achieving a good balance between size and performance.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q4_K_S&lt;/code>&lt;/strong>: 4-bit K-Quant, small version.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_K_M&lt;/code>&lt;/strong>: 5-bit K-Quant, medium size. Provides better precision than 4-bit while being smaller than &lt;code>Q8_0&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q6_K&lt;/code>&lt;/strong>: 6-bit K-Quant. Provides very high precision, close to &lt;code>Q8_0&lt;/code>, but with a smaller size.&lt;/li>
&lt;li>&lt;strong>&lt;code>IQ2_XS&lt;/code>, &lt;code>IQ2_S&lt;/code>, &lt;code>IQ2_XXS&lt;/code>&lt;/strong>: 2-bit quantization variants, where &lt;code>IQ&lt;/code> stands for &amp;ldquo;Inaccurate Quantization,&amp;rdquo; aimed at extreme model compression but with larger precision loss.&lt;/li>
&lt;/ul>
&lt;h3 id="33-how-to-use-the-llamaquantize-tool">3.3 How to Use the &lt;code>llama-quantize&lt;/code> Tool&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a command-line tool called &lt;code>llama-quantize&lt;/code> for converting &lt;code>FP32&lt;/code> or &lt;code>FP16&lt;/code> GGUF models to quantized GGUF models.&lt;/p>
&lt;p>&lt;strong>Basic Usage:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-quantize &amp;lt;input-gguf-file&amp;gt; &amp;lt;output-gguf-file&amp;gt; &amp;lt;quantization-type&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Quantizing an FP16 Model to Q4_K_M&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># First, convert the original model (e.g., PyTorch format) to FP16 GGUF
python3 convert.py models/my-model/
# Then, use llama-quantize for quantization
./llama-quantize ./models/my-model/ggml-model-f16.gguf ./models/my-model/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="34-importance-matrix">3.4 Importance Matrix&lt;/h3>
&lt;p>To further reduce precision loss from quantization, &lt;code>llama.cpp&lt;/code> introduced the concept of an importance matrix (&lt;code>imatrix&lt;/code>). This matrix calculates the importance of each weight by running the model on a calibration dataset. During quantization, &lt;code>llama-quantize&lt;/code> references this matrix to apply smaller quantization errors to more important weights, thereby protecting critical information in the model.&lt;/p>
&lt;p>&lt;strong>Using &lt;code>imatrix&lt;/code> for Quantization:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># 1. Generate the importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
# 2. Use imatrix for quantization
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-Q4_K_M-imatrix.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="35-summary">3.5 Summary&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code>'s quantization scheme is centered around the &lt;code>GGUF&lt;/code> format, providing a rich, efficient, and battle-tested set of quantization types. Its &lt;code>K-Quants&lt;/code> series performs exceptionally well in low-bit quantization, and when combined with advanced techniques like importance matrices, it can maximize model performance while significantly compressing the model. For scenarios requiring LLM deployment on CPUs or resource-limited hardware, &lt;code>llama.cpp&lt;/code> is an excellent choice.&lt;/p>
&lt;h2 id="4-vllms-quantization-ecosystem">4. vLLM's Quantization Ecosystem&lt;/h2>
&lt;p>Unlike &lt;code>llama.cpp&lt;/code>'s cohesive, self-contained quantization system, &lt;code>vLLM&lt;/code>, as a service engine focused on high-performance, high-throughput GPU inference, adopts a &amp;ldquo;best of all worlds&amp;rdquo; quantization strategy. &lt;code>vLLM&lt;/code> doesn't invent new quantization formats but instead embraces compatibility, supporting and integrating the most mainstream and cutting-edge quantization schemes and tool libraries from academia and industry.&lt;/p>
&lt;h3 id="41-mainstream-quantization-schemes-supported-by-vllm">4.1 Mainstream Quantization Schemes Supported by vLLM&lt;/h3>
&lt;p>&lt;code>vLLM&lt;/code> supports directly loading models quantized by various popular algorithms and tool libraries:&lt;/p>
&lt;h4 id="411-gptq-generalpurpose-posttraining-quantization">4.1.1 GPTQ (General-purpose Post-Training Quantization)&lt;/h4>
&lt;p>GPTQ is one of the earliest widely applied LLM PTQ algorithms. It quantizes weights column by column and updates weights using Hessian matrix information to minimize quantization error.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Iteratively quantize each column of weights and update the remaining unquantized weights to compensate for errors introduced by already quantized columns.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load GPTQ quantized models generated by libraries like &lt;code>AutoGPTQ&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing good 4-bit quantization performance with a large number of pre-quantized models available in the community.&lt;/li>
&lt;/ul>
&lt;h4 id="412-awq-activationaware-weight-quantization">4.1.2 AWQ (Activation-aware Weight Quantization)&lt;/h4>
&lt;p>AWQ observes that not all weights in a model are equally important, with a small portion of &amp;ldquo;significant weights&amp;rdquo; having a huge impact on model performance. Similar uneven distributions also exist in activation values.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: By analyzing the scale of activation values, identify and protect those &amp;ldquo;significant weights&amp;rdquo; that multiply with large activation values, giving them higher precision during quantization. It doesn't quantize activation values but makes weights adapt to the distribution of activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load AWQ quantized models generated by the &lt;code>AutoAWQ&lt;/code> library.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Seeking higher model precision than GPTQ at extremely low bits (such as 4-bit), especially when handling complex tasks.&lt;/li>
&lt;/ul>
&lt;h4 id="413-fp8-8bit-floating-point">4.1.3 FP8 (8-bit Floating Point)&lt;/h4>
&lt;p>FP8 is the latest low-precision floating-point format, pushed by hardware manufacturers like NVIDIA. It has a wider dynamic range than traditional &lt;code>INT8&lt;/code>, making it more suitable for representing extremely unevenly distributed activation values in LLMs.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Use 8-bit floating-point numbers (typically in &lt;code>E4M3&lt;/code> or &lt;code>E5M2&lt;/code> format) to represent weights and/or activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Through integration with &lt;code>llm-compressor&lt;/code> and AMD's &lt;code>Quark&lt;/code> library, &lt;code>vLLM&lt;/code> provides strong support for FP8, including both dynamic and static quantization.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing ultimate inference speed and throughput on modern GPUs (such as H100) that support FP8 acceleration.&lt;/li>
&lt;/ul>
&lt;h4 id="414-fp8-kv-cache">4.1.4 FP8 KV Cache&lt;/h4>
&lt;p>This is a quantization technique specifically targeting the KV Cache, a major memory consumer during inference.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Quantize the Key-Value cache stored in GPU memory from &lt;code>FP16&lt;/code> or &lt;code>BF16&lt;/code> to &lt;code>FP8&lt;/code>, thereby halving this portion of memory usage, allowing the model to support longer context windows or larger batch sizes.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> provides native support, which can be enabled at startup with the parameter &lt;code>--kv-cache-dtype fp8&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="415-bitsandbytes">4.1.5 BitsAndBytes&lt;/h4>
&lt;p>This is a very popular quantization library, known for its ease of use and &amp;ldquo;on-the-fly&amp;rdquo; quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Dynamically quantize during model loading, without needing pre-prepared quantized model files.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> integrates &lt;code>BitsAndBytes&lt;/code>, allowing users to easily enable 4-bit quantization by setting the &lt;code>quantization=&amp;quot;bitsandbytes&amp;quot;&lt;/code> parameter.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Quick experimentation, user-friendly, avoiding complex offline quantization processes.&lt;/li>
&lt;/ul>
&lt;h4 id="416-other-schemes">4.1.6 Other Schemes&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>SqueezeLLM&lt;/strong>: A non-uniform quantization method that believes weight importance is related to numerical size, thus using fewer bits for smaller weight values and more bits for larger weight values.&lt;/li>
&lt;li>&lt;strong>TorchAO&lt;/strong>: PyTorch's official quantization tool library, which &lt;code>vLLM&lt;/code> is beginning to support.&lt;/li>
&lt;li>&lt;strong>BitBLAS&lt;/strong>: A low-level computation library aimed at accelerating low-bit (such as 1-bit, 2-bit, 4-bit) matrix operations through optimized kernel functions.&lt;/li>
&lt;/ul>
&lt;h3 id="42-how-to-use-quantized-models-in-vllm">4.2 How to Use Quantized Models in vLLM&lt;/h3>
&lt;p>Using quantization in &lt;code>vLLM&lt;/code> is very simple, typically just requiring specifying the &lt;code>quantization&lt;/code> parameter in the &lt;code>LLM&lt;/code> constructor. &lt;code>vLLM&lt;/code> will automatically detect the quantization type from the model's configuration file (&lt;code>config.json&lt;/code>).&lt;/p>
&lt;p>&lt;strong>Example: Loading an AWQ Quantized Model&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
# vLLM will automatically recognize awq quantization from &amp;quot;TheBloke/My-Model-AWQ&amp;quot;'s config.json
llm = LLM(model=&amp;quot;TheBloke/My-Model-AWQ&amp;quot;, quantization=&amp;quot;awq&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Enabling FP8 KV Cache&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="5-llamacpp-vs-vllm-comparison-and-summary">5. llama.cpp vs. vLLM: Comparison and Summary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">llama.cpp&lt;/th>
&lt;th align="left">vLLM&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Target Platform&lt;/strong>&lt;/td>
&lt;td align="left">CPU, Cross-platform, Edge devices&lt;/td>
&lt;td align="left">High-performance GPU servers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Core Philosophy&lt;/strong>&lt;/td>
&lt;td align="left">Cohesive, self-contained, extreme optimization&lt;/td>
&lt;td align="left">Open, integrated, high throughput&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>File Format&lt;/strong>&lt;/td>
&lt;td align="left">GGUF (custom format)&lt;/td>
&lt;td align="left">Standard Hugging Face format&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Quantization Schemes&lt;/strong>&lt;/td>
&lt;td align="left">Built-in &lt;code>K-Quants&lt;/code>, &lt;code>IQ&lt;/code>, etc.&lt;/td>
&lt;td align="left">Integrates GPTQ, AWQ, FP8, BnB, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ease of Use&lt;/strong>&lt;/td>
&lt;td align="left">Requires &lt;code>llama-quantize&lt;/code> conversion&lt;/td>
&lt;td align="left">Direct loading, automatic detection&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ecosystem&lt;/strong>&lt;/td>
&lt;td align="left">Self-contained ecosystem&lt;/td>
&lt;td align="left">Embraces the entire Python AI ecosystem&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Latest Technology&lt;/strong>&lt;/td>
&lt;td align="left">Quickly follows up and implements own versions&lt;/td>
&lt;td align="left">Quickly integrates latest open-source libraries&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-latest-quantization-trends-and-outlook">6. Latest Quantization Trends and Outlook&lt;/h2>
&lt;p>The field of model quantization is still rapidly evolving. Here are some trends worth noting:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>1-bit/Binary Neural Networks (BNNs)&lt;/strong>: Ultimate model compression, restricting weights to +1 or -1. Although currently suffering significant precision loss in LLMs, its potential is enormous, with related research emerging constantly.&lt;/li>
&lt;li>&lt;strong>Non-uniform Quantization&lt;/strong>: Like SqueezeLLM, dynamically allocating bit numbers based on data distribution, theoretically superior to uniform quantization.&lt;/li>
&lt;li>&lt;strong>Hardware-Algorithm Co-design&lt;/strong>: New hardware (such as FP8, FP4, INT4 support) is driving the development of new quantization algorithms, while new algorithms are guiding future hardware design.&lt;/li>
&lt;li>&lt;strong>Combining Quantization with Sparsification&lt;/strong>: Combining quantization with sparsification techniques like pruning holds promise for achieving higher rates of model compression.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Model quantization is a key technology for addressing the challenges of the large model era. &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code> represent two different quantization philosophies: &lt;code>llama.cpp&lt;/code> provides ultimate local inference performance for resource-constrained devices through its elegant GGUF format and built-in K-Quants; while &lt;code>vLLM&lt;/code> has become the king of GPU cloud inference services through its open ecosystem and integration of various cutting-edge quantization schemes.&lt;/p>
&lt;p>Understanding the quantization implementations of these two frameworks not only helps us choose the right tool for specific scenarios but also gives us insight into the development trajectory and future directions of the entire LLM inference optimization field.&lt;/p></description></item><item><title>SGLang Technical Guide: High-Performance Structured Generation Framework</title><link>https://ziyanglin.netlify.app/en/post/sglang-documentation/</link><pubDate>Thu, 26 Jun 2025 01:07:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/sglang-documentation/</guid><description>&lt;h2 id="1-sglang-introduction">1. SGLang Introduction&lt;/h2>
&lt;p>SGLang (Structured Generation Language) is a high-performance service framework designed for large language models (LLMs) and vision language models (VLMs). Its core goal is to address the challenges faced by complex LLM programs in real-world applications, maximizing inference performance while maintaining flexibility.&lt;/p>
&lt;p>Traditional LLM service frameworks (like vLLM) excel at handling simple, one-shot prompting but face limitations in complex scenarios requiring multi-turn interactions, structured outputs, function calls, or control flow. SGLang effectively bridges this gap by introducing a novel frontend language and an efficient backend runtime.&lt;/p>
&lt;p>&lt;strong>Core advantages of SGLang include:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Exceptional Performance:&lt;/strong> SGLang introduces &lt;strong>RadixAttention&lt;/strong>, an innovative attention mechanism that automatically and losslessly reuses key-value caches (KV Cache), significantly improving inference speed in scenarios with complex prompts (like CoT, ReAct) or multi-turn conversations. Compared to leading frameworks like vLLM, SGLang can achieve several times higher throughput in these scenarios.&lt;/li>
&lt;li>&lt;strong>Powerful Programming Capabilities:&lt;/strong> SGLang provides an intuitive domain-specific language (DSL) that allows developers to orchestrate complex generation tasks in a Pythonic way. You can easily define variables, use loops and conditional statements, call external tools, and seamlessly integrate these logic elements with the LLM's generation process. This makes building complex AI agents, multi-turn dialogue systems, and structured data extraction tasks unprecedentedly simple.&lt;/li>
&lt;li>&lt;strong>Unified Frontend-Backend Interface:&lt;/strong> SGLang decouples frontend programming logic from backend inference services. The frontend defines &amp;ldquo;what to generate,&amp;rdquo; while the backend handles &amp;ldquo;how to efficiently generate it.&amp;rdquo; This design not only simplifies the development process but also makes SGLang compatible with OpenAI's API standards, allowing users to easily migrate existing applications to SGLang and immediately benefit from performance gains.&lt;/li>
&lt;li>&lt;strong>Flexible Structured Output:&lt;/strong> SGLang provides powerful structured output constraint capabilities. Whether through regular expressions, EBNF grammar, or JSON Schema, you can precisely control the output format of the LLM, ensuring that the generated content conforms to the expected structure, which is crucial for applications requiring reliable data formats.&lt;/li>
&lt;/ul>
&lt;p>In summary, SGLang is not just an LLM inference acceleration engine but a complete programming and execution framework for complex generation tasks. It aims to enable developers to fully unleash the potential of large language models in an efficient and intuitive way.&lt;/p>
&lt;h2 id="2-core-features">2. Core Features&lt;/h2>
&lt;p>The power of SGLang lies in its unique design, which combines an intuitive frontend programming model with an efficient backend execution engine. Below are detailed introductions to several of its core features.&lt;/p>
&lt;h3 id="21-radixattention-kv-cache-optimization-for-complex-prompts">2.1 RadixAttention: KV Cache Optimization for Complex Prompts&lt;/h3>
&lt;p>When processing complex LLM programs, such as Chain-of-Thought, multi-turn dialogues, or agents that need to call tools, prompts often contain large shared prefixes. Traditional attention mechanisms produce redundant computation and storage when handling these shared prefixes.&lt;/p>
&lt;p>SGLang introduces &lt;strong>RadixAttention&lt;/strong>, a novel KV cache optimization technique. Its core idea is to organize prompts into a radix tree and perform attention calculations on this tree.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Automatic Sharing and Reuse&lt;/strong>: RadixAttention can automatically identify and share common prefixes between different requests, avoiding duplicate computation and storage. For example, in multi-turn dialogues, the conversation history of each turn can be losslessly reused by subsequent turns.&lt;/li>
&lt;li>&lt;strong>Performance Improvement&lt;/strong>: By maximizing KV cache reuse, RadixAttention significantly reduces memory usage and computational load, increasing throughput by 2 to 5 times, especially when handling long prompts or high-concurrency requests.&lt;/li>
&lt;/ul>
&lt;p>Below is a Mermaid diagram that visually demonstrates how RadixAttention handles requests with shared prefixes:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Traditional Method (No Sharing)&amp;quot;
req1[&amp;quot;Request 1: 'A B C D'&amp;quot;]
req2[&amp;quot;Request 2: 'A B E F'&amp;quot;]
kv1[&amp;quot;KV Cache: [A, B, C, D]&amp;quot;]
kv2[&amp;quot;KV Cache: [A, B, E, F]&amp;quot;]
req1 --&amp;gt; kv1
req2 --&amp;gt; kv2
end
subgraph &amp;quot;SGLang RadixAttention&amp;quot;
Root(&amp;quot;Root&amp;quot;) --&amp;gt; A(&amp;quot;Token 'A'&amp;quot;);
A --&amp;gt; B(&amp;quot;Token 'B'&amp;quot;);
B --&amp;gt; C(&amp;quot;Token 'C'&amp;quot;);
B --&amp;gt; E(&amp;quot;Token 'E'&amp;quot;);
C --&amp;gt; D(&amp;quot;Token 'D'&amp;quot;);
E --&amp;gt; F(&amp;quot;Token 'F'&amp;quot;);
style A fill:#9f9
style B fill:#9f9
end
&lt;/code>&lt;/pre>
&lt;p>In the diagram above, for two requests &lt;code>'A B C D'&lt;/code> and &lt;code>'A B E F'&lt;/code>, the traditional method creates two independent KV caches. RadixAttention, however, organizes them into a tree, sharing the computation and storage of the common prefix &lt;code>'A B'&lt;/code> (green nodes), creating new branches only for the different parts (C, D, E, F). This greatly improves memory and computational efficiency.&lt;/p>
&lt;h3 id="22-unified-frontend-programming-language-dsl">2.2 Unified Frontend Programming Language (DSL)&lt;/h3>
&lt;p>SGLang provides an expressive domain-specific language (DSL) deeply integrated with Python, allowing developers to build complex generation logic in a natural and intuitive way.&lt;/p>
&lt;h3 id="sglang-architecture-overview">SGLang Architecture Overview&lt;/h3>
&lt;p>To better understand how SGLang works, we can observe its core architecture through the following flowchart:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph User Side
A[Developer defines SGLang program&amp;lt;br&amp;gt;using function decorator] --&amp;gt; B{Call run method};
end
subgraph SGLang Frontend
B --&amp;gt; C[1. Parse Python AST&amp;lt;br&amp;gt;Separate deterministic logic and generation instructions];
C --&amp;gt; D[2. Build portable&amp;lt;br&amp;gt;SGLang IR intermediate representation];
end
subgraph Network Communication
D -- HTTP Request --&amp;gt; E[SGLang backend service SRT];
end
subgraph SGLang Backend SRT
E --&amp;gt; F[3. Receive IR and schedule];
F --&amp;gt; G{RadixAttention engine};
G --&amp;gt; H[4. Efficient execution&amp;lt;br&amp;gt;KV cache reuse];
H --&amp;gt; I[LLM/VLM model];
I --&amp;gt; J[5. Generate results];
end
subgraph Return Path
J -- HTTP Response --&amp;gt; K[Return results to frontend];
K --&amp;gt; L[6. Fill state object `s`];
L --&amp;gt; M[User gets final results];
end
style B fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style G fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>This diagram clearly shows how SGLang decouples and combines the programming convenience of the frontend with the high-performance execution engine of the backend.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Pythonic Control Flow&lt;/strong>: You can directly use standard Python control flow statements like &lt;code>if/else&lt;/code> and &lt;code>for&lt;/code> loops in SGLang functions to dynamically build prompts.&lt;/li>
&lt;li>&lt;strong>Integration of Generation and Logic&lt;/strong>: Through the &lt;code>@function&lt;/code> decorator and &lt;code>gen()&lt;/code> instruction, SGLang seamlessly combines the LLM's generation process (the &amp;ldquo;non-deterministic&amp;rdquo; part) with the program's deterministic logic.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: Generating Different Content Based on Conditions&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen
@function
def tool_use(s, question):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question)
s += assistant(
&amp;quot;To answer this question, I need to use a &amp;quot;
+ gen(&amp;quot;tool&amp;quot;, choices=[&amp;quot;calculator&amp;quot;, &amp;quot;search engine&amp;quot;])
+ &amp;quot;. &amp;quot;
)
if s[&amp;quot;tool&amp;quot;] == &amp;quot;calculator&amp;quot;:
s += assistant(&amp;quot;The math expression is: &amp;quot; + gen(&amp;quot;expression&amp;quot;))
elif s[&amp;quot;tool&amp;quot;] == &amp;quot;search engine&amp;quot;:
s += assistant(&amp;quot;The key word to search is: &amp;quot; + gen(&amp;quot;word&amp;quot;))
state = tool_use.run(&amp;quot;What is the population of London?&amp;quot;)
print(state[&amp;quot;tool&amp;quot;])
# Output: search engine
print(state[&amp;quot;word&amp;quot;])
# Output: population of London
&lt;/code>&lt;/pre>
&lt;p>In this example, the program first asks the LLM to choose between &amp;ldquo;calculator&amp;rdquo; and &amp;ldquo;search engine&amp;rdquo; as a tool, then executes different logic branches based on the LLM's choice, guiding the LLM to generate the next step of content.&lt;/p>
&lt;h3 id="23-powerful-structured-output">2.3 Powerful Structured Output&lt;/h3>
&lt;p>To ensure that content generated by the LLM can be reliably parsed and used by downstream programs, SGLang provides multiple powerful structured output constraint mechanisms.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Regular Expressions (Regex)&lt;/strong>: You can provide a regular expression to force the model's output to strictly match that pattern. This is useful for generating identifiers, numbers, or simple text fragments in specific formats.&lt;/p>
&lt;pre>&lt;code class="language-python">response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France?&amp;quot;}],
extra_body={&amp;quot;regex&amp;quot;: &amp;quot;(Paris|London)&amp;quot;},
)
# response.choices[0].message.content will necessarily be &amp;quot;Paris&amp;quot; or &amp;quot;London&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>EBNF Grammar&lt;/strong>: For more complex grammatical structures, you can use Extended Backus-Naur Form (EBNF) to define a complete grammar. This allows you to generate code, DSLs, or other structured text that strictly adheres to specific syntax.&lt;/p>
&lt;pre>&lt;code class="language-python">ebnf_grammar = &amp;quot;&amp;quot;&amp;quot;
root ::= city &amp;quot; is the capital of &amp;quot; country
city ::= &amp;quot;London&amp;quot; | &amp;quot;Paris&amp;quot; | &amp;quot;Berlin&amp;quot; | &amp;quot;Rome&amp;quot;
country ::= &amp;quot;England&amp;quot; | &amp;quot;France&amp;quot; | &amp;quot;Germany&amp;quot; | &amp;quot;Italy&amp;quot;
&amp;quot;&amp;quot;&amp;quot;
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information of the capital of France.&amp;quot;}],
extra_body={&amp;quot;ebnf&amp;quot;: ebnf_grammar},
)
# response.choices[0].message.content will be &amp;quot;Paris is the capital of France&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>JSON Schema&lt;/strong>: SGLang supports using JSON Schema to constrain the model to generate structured JSON objects. You can directly define a JSON Schema or use a Pydantic model to automatically generate one. This is crucial for APIs and data processing tasks that require reliable, verifiable JSON output.&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
class CapitalInfo(BaseModel):
name: str
population: int
response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information and population of the capital of France in the JSON format.&amp;quot;}],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;capital_info&amp;quot;,
&amp;quot;schema&amp;quot;: CapitalInfo.model_json_schema(),
},
},
)
# response.choices[0].message.content will be a JSON string conforming to the CapitalInfo structure
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="3-quick-start">3. Quick Start&lt;/h2>
&lt;p>This section will guide you through installing SGLang, starting the service, and basic usage, allowing you to experience SGLang's powerful features in just a few minutes.&lt;/p>
&lt;h3 id="31-installation">3.1 Installation&lt;/h3>
&lt;p>SGLang can be installed via &lt;code>pip&lt;/code> or the faster &lt;code>uv&lt;/code>. For the best experience and full functionality, it's recommended to install the &lt;code>all&lt;/code> version.&lt;/p>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install --upgrade pip
pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using uv (recommended, faster):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install uv
uv pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: The installation process may require compiling CUDA kernels (such as &lt;code>flashinfer&lt;/code>). Please ensure that the &lt;code>CUDA_HOME&lt;/code> environment variable is correctly configured in your environment and that the CUDA version is compatible with your PyTorch version.&lt;/p>
&lt;/blockquote>
&lt;h3 id="32-starting-the-backend-service-srt">3.2 Starting the Backend Service (SRT)&lt;/h3>
&lt;p>After installation, the next step is to start SGLang's backend service (SRT, SGLang Runtime). This service will load the specified language model and provide an interface compatible with the OpenAI API.&lt;/p>
&lt;p>Run the following command in your terminal:&lt;/p>
&lt;pre>&lt;code class="language-bash">python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Parameter Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>--model-path&lt;/code>: Specifies the path to the model to load. This can be a model name on the Hugging Face Hub (as shown in this example) or a local model path.&lt;/li>
&lt;li>&lt;code>--host&lt;/code>: The host address the service listens on. &lt;code>0.0.0.0&lt;/code> means allowing access from any network interface.&lt;/li>
&lt;li>&lt;code>--port&lt;/code>: The port number the service listens on.&lt;/li>
&lt;/ul>
&lt;p>When the service starts successfully, you'll see output similar to the following, indicating that the model has been loaded and is ready to receive requests.&lt;/p>
&lt;pre>&lt;code>INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
&lt;/code>&lt;/pre>
&lt;h3 id="33-sending-your-first-request">3.3 Sending Your First Request&lt;/h3>
&lt;p>With the service running, we can now interact with it using OpenAI's Python client library.&lt;/p>
&lt;p>Create a Python file named &lt;code>test_sglang.py&lt;/code> and fill it with the following content:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
# Initialize the client, pointing to our locally started SGLang service
client = openai.Client(
base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;,
api_key=&amp;quot;EMPTY&amp;quot; # SGLang service doesn't require an API Key
)
# Create a chat completion request
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;, # Must match the model loaded by the service
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France and why is it famous?&amp;quot;},
],
temperature=0.7,
max_tokens=150,
)
# Print the model's response
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;p>Run this script:&lt;/p>
&lt;pre>&lt;code class="language-bash">python test_sglang.py
&lt;/code>&lt;/pre>
&lt;p>You'll see the model's detailed answer about Paris. At this point, you've successfully completed the entire process from service deployment to inference request using SGLang!&lt;/p>
&lt;h2 id="4-frontend-language-sglang-dsl">4. Frontend Language (SGLang DSL)&lt;/h2>
&lt;p>SGLang's frontend language (DSL) is the core of its usability. It allows you to define complex generation processes in a declarative way, perfectly combining Python's flexibility with the generative capabilities of LLMs.&lt;/p>
&lt;h3 id="41-function-decorator">4.1 &lt;code>@function&lt;/code> Decorator&lt;/h3>
&lt;p>All SGLang programs begin with a Python function decorated by &lt;code>@function&lt;/code>. This decorator transforms an ordinary Python function into an executable SGLang program template.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>State Management&lt;/strong>: The first parameter of the function (typically named &lt;code>s&lt;/code>) represents the current generation state. It's a dictionary-like object used to store and pass all variables produced during the generation process.&lt;/li>
&lt;li>&lt;strong>Delayed Execution&lt;/strong>: Functions decorated with &lt;code>@function&lt;/code> are not executed immediately when defined. Instead, they create a reusable template. The program only executes when the &lt;code>.run()&lt;/code> or &lt;code>.run_batch()&lt;/code> method is called.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Interaction Flow&lt;/strong>&lt;/p>
&lt;p>The entire function call interaction flow can be represented by the following sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant App as Application (Python)
participant SGLang as SGLang Service
participant Tool as External Tool (e.g., Weather API)
User-&amp;gt;&amp;gt;+App: &amp;quot;What's the weather like in Boston?&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send request with messages and tools
SGLang-&amp;gt;&amp;gt;SGLang: Model decides to call get_current_weather
SGLang--&amp;gt;&amp;gt;-App: Return tool_calls with function name and parameters
App-&amp;gt;&amp;gt;App: Parse tool_calls
App-&amp;gt;&amp;gt;+Tool: Call get_current_weather(city=&amp;quot;Boston&amp;quot;, unit=&amp;quot;fahrenheit&amp;quot;)
Tool--&amp;gt;&amp;gt;-App: Return weather result: &amp;quot;68°F&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send new request with weather result
SGLang-&amp;gt;&amp;gt;SGLang: Model generates final reply based on weather result
SGLang--&amp;gt;&amp;gt;-App: Return final natural language reply
App--&amp;gt;&amp;gt;-User: &amp;quot;It's currently 68°F in Boston.&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>This sequence diagram clearly shows the complete loop from user question to model decision, tool call, result integration, and final response.&lt;/p>
&lt;h3 id="42-core-instructions">4.2 Core Instructions&lt;/h3>
&lt;p>Within SGLang functions, you use a series of instructions to build prompts and control the generation flow.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Role Instructions&lt;/strong>: &lt;code>system()&lt;/code>, &lt;code>user()&lt;/code>, &lt;code>assistant()&lt;/code>
These instructions are used to define different parts of a conversation, conforming to the standard multi-turn dialogue format. You can pass strings directly to them.&lt;/li>
&lt;li>&lt;strong>Generation Instruction&lt;/strong>: &lt;code>gen()&lt;/code>
This is the most important instruction in SGLang. It tells the LLM to generate text at the current position.
&lt;ul>
&lt;li>&lt;code>s += gen(&amp;quot;variable_name&amp;quot;, ...)&lt;/code>: The first parameter of &lt;code>gen()&lt;/code> is required and specifies the variable name in which the generation result will be stored in the state &lt;code>s&lt;/code>.&lt;/li>
&lt;li>&lt;code>max_tokens&lt;/code>: Limits the maximum number of tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code>: Defines one or more stop strings. When the model generates these strings, the generation process ends early.&lt;/li>
&lt;li>&lt;code>choices&lt;/code>: Provides a list of strings, forcing the model to choose one of these options for generation.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: A Complete Frontend Function&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
# Set the backend to the OpenAI-compatible service provided by SGLang
set_default_backend(OpenAI(&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;))
@function
def multi_turn_qa(s, question1, question2):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question1)
s += assistant(gen(&amp;quot;answer1&amp;quot;, max_tokens=128))
s += user(question2)
s += assistant(gen(&amp;quot;answer2&amp;quot;, max_tokens=128))
# Execute the SGLang program
state = multi_turn_qa.run(
question1=&amp;quot;What is the capital of the UK?&amp;quot;,
question2=&amp;quot;What is its population?&amp;quot;,
temperature=0.1
)
print(&amp;quot;Answer 1:&amp;quot;, state[&amp;quot;answer1&amp;quot;])
print(&amp;quot;Answer 2:&amp;quot;, state[&amp;quot;answer2&amp;quot;])
&lt;/code>&lt;/pre>
&lt;h3 id="43-streaming-output">4.3 Streaming Output&lt;/h3>
&lt;p>For applications requiring real-time feedback, SGLang supports streaming output. Simply set &lt;code>stream=True&lt;/code> in the &lt;code>.run()&lt;/code> method and iterate over the &lt;code>.text_iter()&lt;/code> method of the returned state object.&lt;/p>
&lt;pre>&lt;code class="language-python">state = multi_turn_qa.run(
question1=&amp;quot;Write a short story about a robot.&amp;quot;,
question2=&amp;quot;Continue the story.&amp;quot;,
stream=True
)
for out in state.text_iter(&amp;quot;answer2&amp;quot;):
print(out, end=&amp;quot;&amp;quot;, flush=True)
&lt;/code>&lt;/pre>
&lt;h2 id="5-backend-service-srt-and-api-reference">5. Backend Service (SRT) and API Reference&lt;/h2>
&lt;p>SGLang's backend, the SGLang Runtime (SRT), is a high-performance inference server implemented in Python. It's responsible for loading models, managing KV caches (through RadixAttention), and handling requests from clients. SRT provides two main API endpoints.&lt;/p>
&lt;h3 id="51-native-api-generate">5.1 Native API: &lt;code>/generate&lt;/code>&lt;/h3>
&lt;p>This is a lower-level API that provides the finest control over the generation process.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /generate&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Generate text starting from a given text prompt.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>text&lt;/code> (string, required): The input text prompt.&lt;/li>
&lt;li>&lt;code>sampling_params&lt;/code> (object, optional): A JSON object containing sampling parameters.
&lt;ul>
&lt;li>&lt;code>temperature&lt;/code> (float): Sampling temperature.&lt;/li>
&lt;li>&lt;code>max_new_tokens&lt;/code> (int): Maximum number of new tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code> (string or list[string]): Stop tokens.&lt;/li>
&lt;li>&lt;code>json_schema&lt;/code> (string): JSON Schema string for constraining output.&lt;/li>
&lt;li>&lt;code>regex&lt;/code> (string): Regular expression for constraining output.&lt;/li>
&lt;li>&lt;code>ebnf&lt;/code> (string): EBNF grammar for constraining output.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>stream&lt;/code> (boolean, optional): Whether to use streaming.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using &lt;code>requests&lt;/code>)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import requests
import json
url = &amp;quot;http://127.0.0.1:30000/generate&amp;quot;
data = {
&amp;quot;text&amp;quot;: &amp;quot;The capital of France is&amp;quot;,
&amp;quot;sampling_params&amp;quot;: {
&amp;quot;temperature&amp;quot;: 0,
&amp;quot;max_new_tokens&amp;quot;: 16,
}
}
response = requests.post(url, json=data)
print(response.json())
# {'text': ' Paris.\n\nThe capital of France is Paris. It is the most populous city in', 'meta': ...}
&lt;/code>&lt;/pre>
&lt;h3 id="52-openai-compatible-api-v1chatcompletions">5.2 OpenAI Compatible API: &lt;code>/v1/chat/completions&lt;/code>&lt;/h3>
&lt;p>For easy migration and integration, SGLang provides a chat completion API fully compatible with OpenAI. You can seamlessly use OpenAI's official client library.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /v1/chat/completions&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Perform chat-style text generation.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>model&lt;/code> (string, required): The name of the model.&lt;/li>
&lt;li>&lt;code>messages&lt;/code> (list[object], required): List of conversation messages.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code>, &lt;code>max_tokens&lt;/code>, &lt;code>stream&lt;/code>, etc.&lt;/li>
&lt;li>&lt;code>response_format&lt;/code> (object, optional): For specifying structured output, such as &lt;code>{&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;, &amp;quot;json_schema&amp;quot;: ...}&lt;/code>.&lt;/li>
&lt;li>&lt;code>extra_body&lt;/code> (object, optional): SGLang-specific extension parameters, such as &lt;code>{&amp;quot;regex&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code> or &lt;code>{&amp;quot;ebnf&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using the &lt;code>openai&lt;/code> library)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
client = openai.Client(base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;, api_key=&amp;quot;EMPTY&amp;quot;)
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;List 3 countries and their capitals.&amp;quot;}],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-usage-function-callingtool-usage">6. Advanced Usage: Function Calling/Tool Usage&lt;/h2>
&lt;p>SGLang's powerful programming model makes it very suitable for building AI agents capable of calling external tools. This is typically achieved through structured output, where the model is guided to generate text in a specific format (usually JSON) describing a function call.&lt;/p>
&lt;p>Here are the steps to build a simple weather query agent:&lt;/p>
&lt;p>&lt;strong>1. Define Tool Schema&lt;/strong>&lt;/p>
&lt;p>First, use JSON Schema to define your tool. This tells the model the name of the tool, its purpose, and what parameters it needs.&lt;/p>
&lt;pre>&lt;code class="language-python">tools = [
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;get_current_weather&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Get the current weather in a given location&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;city&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;The city name&amp;quot;},
&amp;quot;unit&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;enum&amp;quot;: [&amp;quot;celsius&amp;quot;, &amp;quot;fahrenheit&amp;quot;]},
},
&amp;quot;required&amp;quot;: [&amp;quot;city&amp;quot;, &amp;quot;unit&amp;quot;],
},
},
}
]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>2. Guide the Model to Make Function Calls&lt;/strong>&lt;/p>
&lt;p>In the &lt;code>messages&lt;/code> sent to the model, include a system prompt indicating that the model can use these tools. Then, pass &lt;code>tools&lt;/code> and &lt;code>tool_choice=&amp;quot;auto&amp;quot;&lt;/code> in the API call.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
messages = [
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant that can access external tools.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What's the weather like in Boston in fahrenheit?&amp;quot;}
]
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=messages,
tools=tools,
tool_choice=&amp;quot;auto&amp;quot;,
)
# Check if the model decided to call a tool
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
if tool_calls:
# Model decided to call a tool
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f&amp;quot;Function Call: {function_name}&amp;quot;)
print(f&amp;quot;Arguments: {function_args}&amp;quot;)
# Here, you could actually execute the function call
# e.g., result = get_current_weather(**function_args)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Output:&lt;/strong>&lt;/p>
&lt;pre>&lt;code>Function Call: get_current_weather
Arguments: {'city': 'Boston', 'unit': 'fahrenheit'}
&lt;/code>&lt;/pre>
&lt;p>In this way, you can build powerful AI applications capable of interacting with the external world.&lt;/p></description></item><item><title>Llama.cpp Technical Guide: Lightweight LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</link><pubDate>Thu, 26 Jun 2025 01:06:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.&lt;/p>
&lt;p>&lt;strong>Core Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Performance:&lt;/strong> Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).&lt;/li>
&lt;li>&lt;strong>Lightweight:&lt;/strong> Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.&lt;/li>
&lt;li>&lt;strong>Cross-Platform:&lt;/strong> Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.&lt;/li>
&lt;li>&lt;strong>Open Ecosystem:&lt;/strong> Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.&lt;/li>
&lt;li>&lt;strong>Continuous Innovation:&lt;/strong> Quickly follows and implements the latest model architectures and inference optimization techniques.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;h3 id="21-gguf-model-format">2.1. GGUF Model Format&lt;/h3>
&lt;p>GGUF (Georgi Gerganov Universal Format) is the core model file format used by &lt;code>llama.cpp&lt;/code>, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.&lt;/p>
&lt;p>&lt;strong>Key Features:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified File:&lt;/strong> Packages model metadata, vocabulary, and all tensors (weights) in a single file.&lt;/li>
&lt;li>&lt;strong>Extensibility:&lt;/strong> Allows adding new metadata without breaking compatibility.&lt;/li>
&lt;li>&lt;strong>Backward Compatibility:&lt;/strong> Guarantees compatibility with older versions of GGUF models.&lt;/li>
&lt;li>&lt;strong>Memory Efficiency:&lt;/strong> Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.&lt;/li>
&lt;/ul>
&lt;h3 id="22-quantization">2.2. Quantization&lt;/h3>
&lt;p>Quantization is one of the core advantages of &lt;code>llama.cpp&lt;/code>. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).&lt;/p>
&lt;p>&lt;strong>Main Benefits:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size:&lt;/strong> Significantly reduces the size of model files, making them easier to distribute and store.&lt;/li>
&lt;li>&lt;strong>Lower Memory Usage:&lt;/strong> Reduces the RAM required to load the model into memory.&lt;/li>
&lt;li>&lt;strong>Faster Inference:&lt;/strong> Low-precision calculations are typically faster than high-precision ones, especially on CPUs.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>llama.cpp&lt;/code> supports various quantization methods, particularly &lt;strong>k-quants&lt;/strong>, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.&lt;/p>
&lt;h3 id="23-multimodal-support">2.3. Multimodal Support&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Supported Models:&lt;/strong> Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.&lt;/li>
&lt;li>&lt;strong>Working Principle:&lt;/strong> Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.&lt;/li>
&lt;li>&lt;strong>Tools:&lt;/strong> &lt;code>llama-mtmd-cli&lt;/code> and &lt;code>llama-server&lt;/code> provide native support for multimodal models.&lt;/li>
&lt;/ul>
&lt;h2 id="3-usage-methods">3. Usage Methods&lt;/h2>
&lt;h3 id="31-compilation">3.1. Compilation&lt;/h3>
&lt;p>Compiling &lt;code>llama.cpp&lt;/code> from source is very simple.&lt;/p>
&lt;pre>&lt;code class="language-bash">git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make
&lt;/code>&lt;/pre>
&lt;p>For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:&lt;/p>
&lt;pre>&lt;code class="language-bash"># For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1
&lt;/code>&lt;/pre>
&lt;h3 id="32-basic-inference">3.2. Basic Inference&lt;/h3>
&lt;p>After compilation, you can use the &lt;code>llama-cli&lt;/code> tool for inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p &amp;quot;Building a website can be done in 10 simple steps:&amp;quot; -n 400
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>-m&lt;/code>: Specifies the path to the GGUF model file.&lt;/li>
&lt;li>&lt;code>-p&lt;/code>: Specifies the prompt.&lt;/li>
&lt;li>&lt;code>-n&lt;/code>: Specifies the maximum number of tokens to generate.&lt;/li>
&lt;/ul>
&lt;h3 id="33-openai-compatible-server">3.3. OpenAI Compatible Server&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.&lt;/p>
&lt;p>Starting the server:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096
&lt;/code>&lt;/pre>
&lt;p>You can then send requests to &lt;code>http://localhost:8080/v1/chat/completions&lt;/code> just like you would with the OpenAI API.&lt;/p>
&lt;h2 id="4-advanced-features">4. Advanced Features&lt;/h2>
&lt;h3 id="41-speculative-decoding">4.1. Speculative Decoding&lt;/h3>
&lt;p>This is an advanced inference optimization technique that significantly accelerates generation speed by using a small &amp;ldquo;draft&amp;rdquo; model to predict the output of the main model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle:&lt;/strong> The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.&lt;/li>
&lt;li>&lt;strong>Usage:&lt;/strong> Use the &lt;code>--draft-model&lt;/code> parameter in &lt;code>llama-cli&lt;/code> or &lt;code>llama-server&lt;/code> to specify a small, fast draft model.&lt;/li>
&lt;/ul>
&lt;h3 id="42-lora-support">4.2. LoRA Support&lt;/h3>
&lt;p>LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. &lt;code>llama.cpp&lt;/code> supports loading one or more LoRA adapters during inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base-model.gguf --lora lora-adapter.gguf
&lt;/code>&lt;/pre>
&lt;p>You can even set different weights for different LoRA adapters:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5
&lt;/code>&lt;/pre>
&lt;h3 id="43-grammars">4.3. Grammars&lt;/h3>
&lt;p>Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Format:&lt;/strong> Uses a format called GBNF (GGML BNF) to define grammar rules.&lt;/li>
&lt;li>&lt;strong>Application:&lt;/strong> By providing GBNF rules through the &lt;code>grammar&lt;/code> parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example:&lt;/strong> Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))
&lt;/code>&lt;/pre>
&lt;h2 id="5-ecosystem">5. Ecosystem&lt;/h2>
&lt;p>The success of &lt;code>llama.cpp&lt;/code> has spawned a vibrant ecosystem:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python&lt;/a>:&lt;/strong> The most popular Python binding, providing interfaces to almost all features of &lt;code>llama.cpp&lt;/code> and deeply integrated with frameworks like LangChain and LlamaIndex.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://ollama.com/">Ollama&lt;/a>:&lt;/strong> A tool for packaging, distributing, and running models, using &lt;code>llama.cpp&lt;/code> under the hood, greatly simplifying the process of running LLMs locally.&lt;/li>
&lt;li>&lt;strong>Numerous UI Tools:&lt;/strong> The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion">6. Conclusion&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), &lt;code>llama.cpp&lt;/code> provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.&lt;/p></description></item><item><title>vLLM Technical Guide: High-Performance LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/vllm-documentation/</link><pubDate>Thu, 26 Jun 2025 01:05:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/vllm-documentation/</guid><description>&lt;h2 id="1-introduction-to-vllm">1. Introduction to vLLM&lt;/h2>
&lt;p>vLLM is an open-source inference and serving engine designed for large language models (LLMs), renowned for its high throughput and memory efficiency. In the field of LLM serving, vLLM addresses a core pain point: traditional inference systems are inefficient when handling the key-value cache (KV Cache) in Transformer models&amp;rsquo; attention mechanism, resulting in significant memory waste and limited inference speed.&lt;/p>
&lt;p>The memory bottleneck in LLM inference primarily stems from the KV Cache. This cache stores attention keys and values for each previous token in a sequence to accelerate the generation of subsequent tokens. However, the size of the KV Cache is dynamic and difficult to predict, creating enormous challenges for memory management. Traditional systems (like HuggingFace Transformers) typically pre-allocate a large continuous memory space to store the KV Cache, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>vLLM fundamentally solves this problem by introducing its core innovation: the &lt;strong>PagedAttention&lt;/strong> mechanism.&lt;/p>
&lt;h2 id="2-core-features-and-advantages">2. Core Features and Advantages&lt;/h2>
&lt;p>vLLM stands out among numerous LLM inference frameworks thanks to several key features:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extremely High Throughput&lt;/strong>: Through PagedAttention and Continuous Batching, vLLM significantly improves GPU utilization. Its throughput is several times higher than HuggingFace Transformers and outperforms other mainstream inference libraries.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Management&lt;/strong>: The PagedAttention mechanism divides the KV Cache into non-continuous memory blocks, greatly reducing internal and external memory fragmentation. According to official data, it can save up to 55% of memory, meaning you can load larger models or serve more concurrent requests with the same hardware.&lt;/li>
&lt;li>&lt;strong>Flexible Decoding Strategies&lt;/strong>: vLLM supports various complex decoding algorithms, including Parallel Sampling, Beam Search, and Top-K/Top-P sampling, meeting the needs of different application scenarios.&lt;/li>
&lt;li>&lt;strong>OpenAI API Compatibility&lt;/strong>: vLLM provides a service endpoint that is fully compatible with the OpenAI API. This means you can seamlessly integrate vLLM into existing application ecosystems built on the OpenAI API with just a few configuration changes.&lt;/li>
&lt;li>&lt;strong>Distributed Inference&lt;/strong>: For ultra-large models that cannot fit on a single GPU, vLLM supports Tensor Parallelism, distributing model weights and computational load across multiple GPUs for efficient distributed inference.&lt;/li>
&lt;li>&lt;strong>Streaming and Structured Output&lt;/strong>: Supports streaming of generated tokens and can produce structured outputs in specific formats (such as JSON Schema or regular expressions) through Guided Generation.&lt;/li>
&lt;/ul>
&lt;h2 id="3-core-architecture-deep-dive-into-pagedattention">3. Core Architecture: Deep Dive into PagedAttention&lt;/h2>
&lt;p>PagedAttention is the soul of vLLM, with its design inspiration coming from the paging technique used in modern operating systems to manage virtual memory.&lt;/p>
&lt;h3 id="31-working-principle">3.1 Working Principle&lt;/h3>
&lt;p>In traditional methods, the KV Cache for each sequence is stored in continuous memory space. While this approach seems simple, it leads to severe memory fragmentation due to the vast differences in sequence lengths.&lt;/p>
&lt;p>PagedAttention divides each sequence's KV Cache into fixed-size &lt;strong>blocks&lt;/strong>. Each block can store keys and values for a fixed number of tokens. During inference, vLLM's core scheduler dynamically allocates these blocks to sequences as needed.&lt;/p>
&lt;p>The advantages of this design include:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Eliminating Internal Fragmentation&lt;/strong>: Since blocks are of fixed size, a sequence's last block may have some unused space, but this waste is far less than that caused by reserving continuous memory for the entire sequence.&lt;/li>
&lt;li>&lt;strong>Flexible Memory Allocation&lt;/strong>: Blocks are stored in non-continuous memory space, making memory management more flexible, similar to how operating systems manage physical memory pages.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Sharing&lt;/strong>: PagedAttention makes sharing KV Cache between different sequences exceptionally simple and efficient. For example, in parallel sampling or beam search, multiple candidate sequences originate from the same prompt. vLLM allows these sequences to share KV blocks storing the prompt portion, only needing to allocate new, independent blocks for each sequence when generating new tokens. This &amp;ldquo;Copy-on-Write&amp;rdquo; mechanism greatly reduces the memory overhead of complex decoding algorithms.&lt;/li>
&lt;/ol>
&lt;p>Below is a Mermaid diagram that more intuitively illustrates PagedAttention's memory management approach:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Physical_Memory [KV Cache Physical Memory]
direction LR
B1(Block 1)
B2(Block 2)
B3(Block 3)
B4(Block 4)
B5(Block 5)
B6(Block 6)
B7(Block 7)
B8(Block 8)
end
subgraph Logical_View [Sequence Logical View]
direction TB
subgraph Seq1 [Sequence 1]
P1(Prompt) --&amp;gt; T1(Token 1)
end
subgraph Seq2 [Sequence 2]
P2(Prompt) --&amp;gt; T2(Token 1) --&amp;gt; T3(Token 2)
end
subgraph Seq3 [Parallel Sampling]
P3(Prompt) --&amp;gt; T4(Token 1a)
P3 --&amp;gt; T5(Token 1b)
end
end
subgraph Block_Table [Block Table]
direction TB
Map1[&amp;quot;Seq 1: [B1, B5]&amp;quot;]
Map2[&amp;quot;Seq 2: [B2, B6, B8]&amp;quot;]
Map3[&amp;quot;Seq 3a: [B3, B7]&amp;quot;]
Map4[&amp;quot;Seq 3b: [B3, B4]&amp;quot;]
end
Seq1 --&amp;gt; Map1
Seq2 --&amp;gt; Map2
Seq3 --&amp;gt; Map3
Seq3 --&amp;gt; Map4
Map1 --&amp;gt; B1
Map1 --&amp;gt; B5
Map2 --&amp;gt; B2
Map2 --&amp;gt; B6
Map2 --&amp;gt; B8
Map3 --&amp;gt; B3
Map3 --&amp;gt; B7
Map4 --&amp;gt; B3
Map4 --&amp;gt; B4
style B3 fill:#f9f,stroke:#333,stroke-width:2px
linkStyle 8 stroke-width:2px,stroke:green,fill:none;
linkStyle 11 stroke-width:2px,stroke:green,fill:none;
linkStyle 12 stroke-width:2px,stroke:green,fill:none;
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Diagram explanation:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>KV Cache Physical Memory&lt;/strong>: Represents non-continuous physical memory blocks on the GPU.&lt;/li>
&lt;li>&lt;strong>Sequence Logical View&lt;/strong>: Represents multiple requests (sequences) being processed.&lt;/li>
&lt;li>&lt;strong>Block Table&lt;/strong>: vLLM's core component that maps logical token positions to physical memory blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: Note that the two branches in &amp;ldquo;Parallel Sampling&amp;rdquo; (3a and 3b) share the same Prompt block (B3), demonstrating PagedAttention's efficient memory sharing.&lt;/li>
&lt;/ul>
&lt;h3 id="32-continuous-batching">3.2 Continuous Batching&lt;/h3>
&lt;p>Based on PagedAttention, vLLM implements a more advanced batching strategy—continuous batching. Traditional static batching requires waiting for all sequences in a batch to complete generation before processing the next batch. Continuous batching, however, allows new requests to be inserted into the batch immediately after a sequence in the batch completes generation, avoiding GPU idle waiting and further improving throughput.&lt;/p>
&lt;p>Below is a comparison of the two batching methods using a Mermaid sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant C as Client
participant S as Server
participant G as GPU
note over C, G: --- Static Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process Batch 1 [R1, R2, R3, R4]
note right of G: All requests process in parallel
G--&amp;gt;&amp;gt;S: Batch 1 Finished
note right of S: Wait for the entire batch to complete
S--&amp;gt;&amp;gt;C: Response [O1, O2, O3, O4]
C-&amp;gt;&amp;gt;S: Request [R5, R6]
S-&amp;gt;&amp;gt;G: Process Batch 2 [R5, R6]
note over C, G: --- Continuous Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process [R1, R2, R3, R4]
G--&amp;gt;&amp;gt;S: R2 Finished
S--&amp;gt;&amp;gt;C: Response O2
C-&amp;gt;&amp;gt;S: New Request R5
S-&amp;gt;&amp;gt;G: Add R5 to queue (GPU is not idle)
note right of G: R1, R3, R4, R5 are now processing
G--&amp;gt;&amp;gt;S: R4 Finished
S--&amp;gt;&amp;gt;C: Response O4
&lt;/code>&lt;/pre>
&lt;h2 id="4-quick-start-guide">4. Quick Start Guide&lt;/h2>
&lt;p>Below, we'll demonstrate how to install and use vLLM through a few simple steps.&lt;/p>
&lt;h3 id="41-installation">4.1 Installation&lt;/h3>
&lt;p>You can install vLLM using either &lt;code>pip&lt;/code> or &lt;code>uv&lt;/code> (a faster package installation tool). Using &lt;code>uv&lt;/code> is recommended as it can automatically detect your CUDA version and install the matching PyTorch backend.&lt;/p>
&lt;p>&lt;strong>Using uv (recommended):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install vllm --torch-backend=auto
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;h3 id="42-offline-inference">4.2 Offline Inference&lt;/h3>
&lt;p>The &lt;code>vllm.LLM&lt;/code> class makes offline inference very convenient.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
# Define input prompts
prompts = [
&amp;quot;Hello, my name is&amp;quot;,
&amp;quot;The capital of France is&amp;quot;,
&amp;quot;The future of AI is&amp;quot;,
]
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine (model will be automatically downloaded from Hugging Face)
llm = LLM(model=&amp;quot;facebook/opt-125m&amp;quot;)
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="43-launching-an-openaicompatible-server">4.3 Launching an OpenAI-Compatible Server&lt;/h3>
&lt;p>One of vLLM's most powerful features is its built-in API server. With just one command, you can start a service compatible with the OpenAI API.&lt;/p>
&lt;pre>&lt;code class="language-bash">vllm serve Qwen/Qwen2.5-1.5B-Instruct
&lt;/code>&lt;/pre>
&lt;p>By default, the server will run on &lt;code>http://localhost:8000&lt;/code>.&lt;/p>
&lt;h3 id="44-interacting-with-the-server">4.4 Interacting with the Server&lt;/h3>
&lt;p>You can interact with the server using &lt;code>curl&lt;/code> or the &lt;code>openai&lt;/code> Python client.&lt;/p>
&lt;p>&lt;strong>Using curl:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7,
&amp;quot;temperature&amp;quot;: 0
}'
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using the OpenAI Python client:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url=&amp;quot;http://localhost:8000/v1&amp;quot;,
api_key=&amp;quot;not-used&amp;quot; # API key is not required
)
completion = client.chat.completions.create(
model=&amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Who won the world series in 2020?&amp;quot;}
]
)
print(completion.choices[0].message)
&lt;/code>&lt;/pre>
&lt;h2 id="5-model-serving">5. Model Serving&lt;/h2>
&lt;h3 id="51-distributed-serving">5.1 Distributed Serving&lt;/h3>
&lt;p>If a model is too large to fit on a single GPU, you can distribute it across multiple GPUs using tensor parallelism.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Start a service on 4 GPUs
vllm serve facebook/opt-13b --tensor-parallel-size 4
&lt;/code>&lt;/pre>
&lt;h3 id="52-docker-deployment">5.2 Docker Deployment&lt;/h3>
&lt;p>vLLM provides official Docker images for convenient containerized deployment.&lt;/p>
&lt;pre>&lt;code class="language-bash">docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env &amp;quot;HUGGING_FACE_HUB_TOKEN=&amp;lt;your-hf-token&amp;gt;&amp;quot; \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-structured-outputs">6.1 Structured Outputs&lt;/h3>
&lt;p>vLLM supports various ways to constrain the model's output format, which is crucial for applications requiring reliable, parsable outputs.&lt;/p>
&lt;p>&lt;strong>Generating JSON using Pydantic models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:8000/v1&amp;quot;, api_key=&amp;quot;dummy&amp;quot;)
model = client.models.list().data[0].id
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Generate a JSON with the name and age of one random person.&amp;quot;}
],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;people&amp;quot;,
&amp;quot;schema&amp;quot;: People.model_json_schema()
}
},
)
print(completion.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h3 id="62-lora-support">6.2 LoRA Support&lt;/h3>
&lt;p>vLLM can efficiently serve multiple LoRA adapters on the same base model. This is particularly useful for scenarios requiring customized models for different customers or tasks.&lt;/p>
&lt;p>&lt;strong>Starting a server with LoRA support:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Specifying a LoRA adapter in a request:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;sql-lora&amp;quot;, # Specify the LoRA model ID
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7
}'
&lt;/code>&lt;/pre>
&lt;h3 id="63-quantization">6.3 Quantization&lt;/h3>
&lt;p>Quantization is a technique to reduce model size and memory usage by lowering the precision of model weights. vLLM supports various quantization schemes, such as AWQ and FP8 KV cache.&lt;/p>
&lt;p>&lt;strong>Enabling FP8 KV cache:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;,
calculate_kv_scales=True # Dynamically calculate quantization scales
)
&lt;/code>&lt;/pre>
&lt;h2 id="7-framework-integration">7. Framework Integration&lt;/h2>
&lt;p>vLLM can be easily integrated with popular LLM application frameworks like Langchain and LlamaIndex for building complex systems such as Retrieval-Augmented Generation (RAG). Typically, vLLM serves as a backend providing fast LLM inference and embedding generation services.&lt;/p>
&lt;p>&lt;strong>Installing related dependencies:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install -U vllm langchain_openai langchain_community
&lt;/code>&lt;/pre>
&lt;p>Afterward, in Langchain, you can point the &lt;code>base_url&lt;/code> of &lt;code>ChatOpenAI&lt;/code> or &lt;code>OpenAIEmbeddings&lt;/code> to your vLLM server's address to complete the integration.&lt;/p>
&lt;h2 id="8-conclusion">8. Conclusion&lt;/h2>
&lt;p>Through its innovative PagedAttention architecture, vLLM successfully addresses memory management and performance bottlenecks in LLM inference, providing developers with an extremely efficient, flexible, and easy-to-use inference serving engine. Whether conducting quick offline experiments or deploying production-grade, high-concurrency LLM services, vLLM demonstrates excellent performance and powerful functionality. As the community continues to develop, vLLM is becoming one of the standard tools in the field of LLM serving.&lt;/p></description></item><item><title>LoRA Technical Guide: Parameter-Efficient Fine-Tuning for Large Models</title><link>https://ziyanglin.netlify.app/en/post/lora-documentation/</link><pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/lora-documentation/</guid><description>&lt;h2 id="1-introduction-why-lora">1. Introduction: Why LoRA?&lt;/h2>
&lt;p>In today's rapidly evolving landscape of Large Language Models (LLMs) and generative AI, we've witnessed an explosive growth in model sizes, ranging from hundreds of millions to trillions of parameters. These massive models demonstrate remarkable capabilities across various tasks. However, a significant challenge emerges: how can we fine-tune these models for specific downstream tasks?&lt;/p>
&lt;p>The traditional &lt;strong>Full Fine-Tuning&lt;/strong> approach, which updates all parameters of a model, faces severe challenges:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High computational cost&lt;/strong>: Fine-tuning a model with billions of parameters requires enormous computational resources and hundreds of GB of GPU memory, which is prohibitively expensive for most developers and small to medium-sized enterprises.&lt;/li>
&lt;li>&lt;strong>Massive storage requirements&lt;/strong>: Each fine-tuned model for a specific task requires storing a complete model copy, leading to rapidly escalating storage costs.&lt;/li>
&lt;li>&lt;strong>Deployment difficulties&lt;/strong>: Maintaining and switching between multiple massive model copies for different tasks in a production environment is a nightmare.&lt;/li>
&lt;/ul>
&lt;p>To address these pain points, &lt;strong>Parameter-Efficient Fine-Tuning (PEFT)&lt;/strong> techniques have emerged. The core idea is to freeze most parameters of the pre-trained model during fine-tuning and only adjust a small portion (typically far less than 1% of the total) of new or specific parameters.&lt;/p>
&lt;p>Among the various PEFT techniques, &lt;strong>LoRA (Low-Rank Adaptation of Large Language Models)&lt;/strong> stands out for its excellent performance, efficiency, and implementation simplicity, becoming one of the most mainstream and widely applied solutions today. This document will provide an in-depth yet accessible introduction to the core principles of LoRA and offer detailed practical guidance.&lt;/p>
&lt;h2 id="2-core-principles-the-magic-of-lora">2. Core Principles: The Magic of LoRA&lt;/h2>
&lt;p>LoRA's core assumption is that &lt;strong>the weight changes in large language models when adapting to new tasks are low-rank&lt;/strong>. In other words, although the weight matrix &lt;code>W&lt;/code> of the pre-trained model is very large (e.g., &lt;code>d x d&lt;/code> dimensions), the weight change &lt;code>ΔW&lt;/code> during fine-tuning has a very low &amp;ldquo;intrinsic rank.&amp;rdquo;&lt;/p>
&lt;p>Based on this assumption, LoRA doesn't directly update &lt;code>W&lt;/code>, but instead approximates &lt;code>ΔW&lt;/code> by training two smaller, low-rank matrices &lt;code>B&lt;/code> and &lt;code>A&lt;/code>, such that &lt;code>ΔW ≈ BA&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>&lt;code>W&lt;/code> is the pre-trained, frozen weight matrix.&lt;/li>
&lt;li>&lt;code>A&lt;/code> is an &lt;code>r x d&lt;/code> dimensional matrix, where &lt;code>r&lt;/code> is a rank much smaller than &lt;code>d&lt;/code>.&lt;/li>
&lt;li>&lt;code>B&lt;/code> is a &lt;code>d x r&lt;/code> dimensional matrix.&lt;/li>
&lt;/ul>
&lt;p>During fine-tuning, only the parameters of matrices &lt;code>A&lt;/code> and &lt;code>B&lt;/code> are trainable. The forward propagation computation process is accordingly changed to:&lt;/p>
&lt;p>&lt;code>h = Wx + BAx&lt;/code>&lt;/p>
&lt;p>Here's a diagram that illustrates this process more intuitively:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Input x] --&amp;gt; B(Pre-trained weights W);
A --&amp;gt; C(Low-rank matrix A);
C --&amp;gt; D(Low-rank matrix B);
B --&amp;gt; E[Wx];
D --&amp;gt; F[BAx];
E --&amp;gt; G((Sum));
F --&amp;gt; G;
G --&amp;gt; H[Final output h];
style B fill:#eee,stroke:#333,stroke-width:2px,stroke-dasharray: 5, 5
style C fill:#9cf,stroke:#333,stroke-width:2px
style D fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>Where &lt;code>x&lt;/code> is the input and &lt;code>h&lt;/code> is the output. This approach greatly reduces the number of parameters that need to be trained. For example, if &lt;code>d = 4096&lt;/code> and &lt;code>r = 8&lt;/code>, the original matrix &lt;code>W&lt;/code> has &lt;code>4096 * 4096 ≈ 16.7M&lt;/code> parameters, while &lt;code>A&lt;/code> and &lt;code>B&lt;/code> together have only &lt;code>4096 * 8 + 8 * 4096 ≈ 65K&lt;/code> parameters, reducing the parameter count by approximately 256 times!&lt;/p>
&lt;p>&lt;strong>Key parameter &lt;code>r&lt;/code>&lt;/strong>: The rank &lt;code>r&lt;/code> is the most important hyperparameter in LoRA. It controls the size of the low-rank matrices and directly determines the number of new parameters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Smaller &lt;code>r&lt;/code>&lt;/strong>: Fewer trainable parameters, faster training speed, lower memory usage, but may not fully capture complex features of the task.&lt;/li>
&lt;li>&lt;strong>Larger &lt;code>r&lt;/code>&lt;/strong>: More trainable parameters, stronger model fitting capability, but increases computational cost and risk of overfitting.
In practice, &lt;code>r&lt;/code> is typically set to 8, 16, 32, or 64, which achieves a good balance between performance and efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="3-significant-advantages-of-lora">3. Significant Advantages of LoRA&lt;/h2>
&lt;p>Compared to full fine-tuning, LoRA demonstrates overwhelming advantages in multiple aspects:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Extreme parameter efficiency&lt;/strong>: As mentioned above, LoRA only requires training a tiny fraction of parameters. We can see this intuitively through the &lt;code>print_trainable_parameters()&lt;/code> function, where the proportion of trained parameters is typically less than 1%.&lt;/li>
&lt;li>&lt;strong>Faster training speed&lt;/strong>: With a significantly reduced number of parameters for gradient computation and updates, training time is also shortened, accelerating the iteration cycle.&lt;/li>
&lt;li>&lt;strong>Lower hardware requirements&lt;/strong>: LoRA significantly reduces GPU memory (VRAM) usage during training, making it possible to fine-tune models with tens of billions of parameters on consumer-grade GPUs (such as RTX 3090/4090).&lt;/li>
&lt;li>&lt;strong>Flexibility in deployment and management&lt;/strong>: This is one of LoRA's most attractive advantages. The pre-trained model remains unchanged and can be shared across all tasks. For each downstream task, we only need to save a lightweight (typically just a few MB to tens of MB) LoRA adapter (i.e., the weights of matrices A and B). During deployment, the appropriate adapter can be loaded dynamically according to needs, greatly simplifying model management and switching in multi-task scenarios.&lt;/li>
&lt;/ol>
&lt;h2 id="4-handson-practice-lora-training-methods">4. Hands-on Practice: LoRA Training Methods&lt;/h2>
&lt;p>Below, we'll demonstrate a complete example of how to fine-tune a large model using LoRA with the &lt;code>transformers&lt;/code>, &lt;code>peft&lt;/code>, and &lt;code>trl&lt;/code> libraries from the Hugging Face ecosystem.&lt;/p>
&lt;h3 id="step-1-environment-preparation">Step 1: Environment Preparation&lt;/h3>
&lt;p>First, ensure you have installed the necessary Python libraries:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install transformers peft trl datasets torch
&lt;/code>&lt;/pre>
&lt;h3 id="step-2-load-model-tokenizer-and-dataset">Step 2: Load Model, Tokenizer, and Dataset&lt;/h3>
&lt;p>We select a pre-trained model as the foundation and load the corresponding tokenizer. At the same time, we load a dataset from the Hugging Face Hub for fine-tuning.&lt;/p>
&lt;pre>&lt;code class="language-python">from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
# Model ID, can be any supported Causal LM
model_id = &amp;quot;facebook/opt-350m&amp;quot;
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_id)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load dataset (using English quotes dataset as an example)
dataset = load_dataset(&amp;quot;Abirate/english_quotes&amp;quot;, split=&amp;quot;train&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="step-3-configure-lora-loraconfig">Step 3: Configure LoRA (&lt;code>LoraConfig&lt;/code>)&lt;/h3>
&lt;p>This is the core step of LoRA fine-tuning. We need to create a &lt;code>LoraConfig&lt;/code> object to define the behavior of the LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import LoraConfig
lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices, recommended values are 8, 16, 32
lora_alpha=32, # Scaling factor, typically set to twice the value of r
target_modules=[&amp;quot;q_proj&amp;quot;, &amp;quot;v_proj&amp;quot;], # Specify which model layers to apply LoRA to. For Transformer models, typically q_proj and v_proj
lora_dropout=0.05, # Dropout probability for LoRA layers
bias=&amp;quot;none&amp;quot;, # Whether to train bias terms, &amp;quot;none&amp;quot; means not training
task_type=&amp;quot;CAUSAL_LM&amp;quot; # Task type, here it's causal language modeling
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>target_modules&lt;/code>: This parameter is crucial. It tells the PEFT library which modules (typically &lt;code>nn.Linear&lt;/code> layers) in the model should have LoRA applied. For most Transformer models, applying it to the query and value projection layers in the Attention mechanism (i.e., &lt;code>q_proj&lt;/code> and &lt;code>v_proj&lt;/code>) is a common practice. You can print the &lt;code>model&lt;/code> object to see the names of all its modules to determine which can be targeted.&lt;/li>
&lt;/ul>
&lt;h3 id="step-4-apply-lora-and-train-with-sfttrainer">Step 4: Apply LoRA and Train with &lt;code>SFTTrainer&lt;/code>&lt;/h3>
&lt;p>The &lt;code>SFTTrainer&lt;/code> (Supervised Fine-tuning Trainer) provided by the &lt;code>trl&lt;/code> library greatly simplifies the fine-tuning process. It has built-in support for &lt;code>peft&lt;/code>, so we just need to pass the model, tokenizer, dataset, and &lt;code>peft_config&lt;/code> to it.&lt;/p>
&lt;pre>&lt;code class="language-python">from trl import SFTTrainer
# Define training parameters
training_args = TrainingArguments(
output_dir=&amp;quot;./lora_finetuned_model&amp;quot;, # Model output directory
num_train_epochs=3, # Number of training epochs
per_device_train_batch_size=4, # Training batch size per device
logging_dir='./logs', # Logging directory
logging_steps=50, # Log every this many steps
learning_rate=2e-4, # Learning rate
)
# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
peft_config=lora_config, # Pass in LoRA configuration
dataset_text_field=&amp;quot;quote&amp;quot;, # Field name containing text in the dataset
)
# Start training
trainer.train()
# Save the trained LoRA adapter
trainer.save_model()
&lt;/code>&lt;/pre>
&lt;p>After training is complete, an &lt;code>adapter_model.bin&lt;/code> file and an &lt;code>adapter_config.json&lt;/code> file will be generated in the &lt;code>output_dir&lt;/code> directory. These are the lightweight LoRA adapter we've trained.&lt;/p>
&lt;h3 id="step-5-inference-with-the-trained-lora-adapter">Step 5: Inference with the Trained LoRA Adapter&lt;/h3>
&lt;p>For inference, we first load the original pre-trained model, then load the trained LoRA adapter weights.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
# Load the original, non-fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained(model_id)
# Load the LoRA adapter
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Now model_with_lora is a model with LoRA weights integrated, ready for inference
prompt = &amp;quot;The best way to predict the future is to&amp;quot;
inputs = tokenizer(prompt, return_tensors=&amp;quot;pt&amp;quot;)
# Generate text
outputs = model_with_lora.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
&lt;/code>&lt;/pre>
&lt;h2 id="5-lora-model-deployment-from-static-to-dynamic">5. LoRA Model Deployment: From Static to Dynamic&lt;/h2>
&lt;p>After training, efficiently deploying LoRA models into production environments is the crucial next step. LoRA deployment strategies mainly fall into two categories: &lt;strong>Weight Merging (Static Deployment)&lt;/strong> and &lt;strong>Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>. The following flowcharts illustrate these two paths:&lt;/p>
&lt;p>&lt;strong>Option 1: Weight Merging (Static Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[Base Model + LoRA Adapter];
B --&amp;gt; C[&amp;quot;Call merge_and_unload()&amp;quot;];
C --&amp;gt; D[Generate standalone full model];
D --&amp;gt; E[Standard deployment];
style D fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Option 2: Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[vLLM / TGI server];
B --&amp;gt; C[Load Base Model];
C --&amp;gt; D[Load multiple LoRA Adapters];
D --&amp;gt; E[On-demand inference combinations];
style E fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="option-1-weight-merging-and-standard-deployment-static">Option 1: Weight Merging and Standard Deployment (Static)&lt;/h3>
&lt;p>This is the simplest and most direct deployment approach. The core idea is to merge the lightweight LoRA adapter weights into the original base model weights, generating a new, standalone full model.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:
Using the &lt;code>merge_and_unload()&lt;/code> method from the &lt;code>peft&lt;/code> library, this process can be easily completed.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assuming model_id and lora_path are defined
base_model = AutoModelForCausalLM.from_pretrained(model_id)
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Merge weights
merged_model = model_with_lora.merge_and_unload()
# Now merged_model is a standard Transformers model
# You can save it like any other model
merged_model.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
tokenizer.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>Afterward, you can load and use this &lt;code>merged_lora_model&lt;/code> just like any regular Hugging Face model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Zero inference latency&lt;/strong>: After merging, the inference process is identical to a standard model, with no additional computational overhead.&lt;/li>
&lt;li>&lt;strong>Simple deployment&lt;/strong>: No need for any additional inference framework support, can be used directly with standard libraries like &lt;code>transformers&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Loss of flexibility&lt;/strong>: For each LoRA adapter, you need to save and load a complete model copy, defeating the lightweight purpose of LoRA.&lt;/li>
&lt;li>&lt;strong>High storage cost&lt;/strong>: If you have multiple adapters, the storage overhead is enormous.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-2-highperformance-dynamic-deployment-with-vllm-recommended">Option 2: High-Performance Dynamic Deployment with vLLM (Recommended)&lt;/h3>
&lt;p>For scenarios requiring simultaneous service of multiple LoRA adapters, &lt;strong>vLLM&lt;/strong> is currently the industry-leading high-performance inference and serving engine. Through core technologies such as &lt;strong>PagedAttention&lt;/strong>, it achieves efficient management and dynamic loading of multiple LoRA adapters, delivering extremely high throughput without significantly sacrificing performance.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Install vLLM&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Start vLLM server&lt;/strong>:
Use the &lt;code>vllm serve&lt;/code> command to start an OpenAI-compatible API server. The key is to enable LoRA support with &lt;code>--enable-lora&lt;/code> and optionally preload adapters with &lt;code>--lora-modules&lt;/code>.&lt;/p>
&lt;pre>&lt;code class="language-bash"># lora_path points to your trained adapter directory
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules my_sql_lora=/path/to/your/sql_lora_adapter
&lt;/code>&lt;/pre>
&lt;p>Here, we've preloaded an adapter named &lt;code>my_sql_lora&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Send inference requests&lt;/strong>:
You can send requests to the vLLM server using &lt;code>curl&lt;/code> or any HTTP client. Just specify the &lt;code>model&lt;/code> in the request body as the name of your loaded LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;my_sql_lora&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Write a SQL query for all users.&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 64
}'
&lt;/code>&lt;/pre>
&lt;p>vLLM will automatically route the request to the corresponding LoRA adapter for inference.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Using Python Client&lt;/strong>:
vLLM also provides a Python API for direct calls in code.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
# Initialize LLM engine with LoRA support
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
sampling_params = SamplingParams(max_tokens=64)
# In the generate call, specify which adapter to use via lora_request
outputs = llm.generate(
&amp;quot;Write a SQL query for all users.&amp;quot;,
sampling_params,
lora_request=LoRARequest(&amp;quot;my_sql_lora&amp;quot;, 1, &amp;quot;/path/to/your/sql_lora_adapter&amp;quot;)
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Extremely high throughput&lt;/strong>: Designed for large-scale concurrent inference.&lt;/li>
&lt;li>&lt;strong>Dynamic flexibility&lt;/strong>: Can simultaneously serve hundreds or thousands of LoRA adapters, loading them on demand, perfect for multi-tenant scenarios.&lt;/li>
&lt;li>&lt;strong>Memory efficient&lt;/strong>: PagedAttention mechanism effectively manages GPU memory, avoiding waste.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Slightly more complex deployment&lt;/strong>: Requires additional learning and configuration of vLLM service.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-3-other-dynamic-deployment-options-eg-tgi">Option 3: Other Dynamic Deployment Options (e.g., TGI)&lt;/h3>
&lt;p>Hugging Face's own &lt;strong>Text Generation Inference (TGI)&lt;/strong> is another powerful production-grade inference server. Similar to vLLM, TGI also supports loading multiple LoRA adapters at startup and dynamically applying them based on incoming request headers. It integrates best with the Hugging Face ecosystem and is a strong competitor to vLLM.&lt;/p>
&lt;h3 id="deployment-options-comparison-summary">Deployment Options Comparison Summary&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">Weight Merging (Static)&lt;/th>
&lt;th align="left">vLLM (Dynamic)&lt;/th>
&lt;th align="left">TGI (Dynamic)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Performance/Throughput&lt;/strong>&lt;/td>
&lt;td align="left">Highest (lowest single request latency)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Flexibility&lt;/strong>&lt;/td>
&lt;td align="left">Low (no dynamic capability)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Deployment Complexity&lt;/strong>&lt;/td>
&lt;td align="left">Low&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Memory Usage&lt;/strong>&lt;/td>
&lt;td align="left">Very High (N adapters = N times memory)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Suitable Scenarios&lt;/strong>&lt;/td>
&lt;td align="left">Single, fixed tasks&lt;/td>
&lt;td align="left">Multi-tenant, high-concurrency, multi-task scenarios&lt;/td>
&lt;td align="left">Production deployment in Hugging Face ecosystem&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-advanced-topics">6. Advanced Topics&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Multi-adapter Management&lt;/strong>: PEFT supports dynamically adding, switching, and disabling multiple adapters on a single model using methods like &lt;code>model.add_adapter()&lt;/code> and &lt;code>model.set_adapter()&lt;/code>, providing great convenience for building flexible multi-task systems.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>As a revolutionary parameter-efficient fine-tuning technique, LoRA successfully addresses the high cost challenges of fine-tuning in the era of large models. Through clever low-rank decomposition ideas, it greatly reduces computational resource and storage requirements while maintaining fine-tuning effectiveness. Combined with advanced inference engines like vLLM, LoRA deployment and service have become unprecedentedly efficient and flexible, driving the application of large models in more specific scenarios.&lt;/p></description></item></channel></rss>