<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>llama.cpp | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/llama.cpp/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/llama.cpp/index.xml" rel="self" type="application/rss+xml"/><description>llama.cpp</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Fri, 27 Jun 2025 00:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>llama.cpp</title><link>https://ziyanglin.netlify.app/en/tags/llama.cpp/</link></image><item><title>Model Quantization Guide: A Comprehensive Analysis from Theory to Practice</title><link>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</link><pubDate>Fri, 27 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>As large language models (LLMs) continue to grow in scale and complexity, their deployment and inference costs have become increasingly expensive. Model quantization, as a key optimization technique, significantly reduces model storage requirements, memory consumption, and computational load by lowering the numerical precision of model weights and activation values, enabling efficient inference on resource-constrained devices such as mobile and edge devices.&lt;/p>
&lt;p>This document aims to provide a clear and comprehensive introduction to the core concepts of deep learning model quantization, mainstream approaches, and specific implementations in two leading inference frameworks—&lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>. We will explore in detail the quantization types they support, underlying principles, usage methods, and future trends in quantization technology.&lt;/p>
&lt;h2 id="2-quantization-fundamentals">2. Quantization Fundamentals&lt;/h2>
&lt;p>Before diving into specific frameworks, we need to understand some basic concepts of quantization.&lt;/p>
&lt;h3 id="21-what-is-model-quantization">2.1 What is Model Quantization?&lt;/h3>
&lt;p>Model quantization refers to the process of converting floating-point numbers in a model (typically 32-bit floating-point, or &lt;code>FP32&lt;/code>) to integers with fewer bits (such as &lt;code>INT8&lt;/code>, &lt;code>INT4&lt;/code>) or lower-precision floating-point numbers (such as &lt;code>FP16&lt;/code>, &lt;code>FP8&lt;/code>). This process is essentially a form of information compression that attempts to significantly reduce model complexity while preserving model accuracy as much as possible.&lt;/p>
&lt;h3 id="22-why-is-quantization-needed">2.2 Why is Quantization Needed?&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size&lt;/strong>: Lower bit-width numerical representations can significantly reduce the size of model files. For example, quantizing an &lt;code>FP32&lt;/code> model to &lt;code>INT8&lt;/code> can reduce the model size by approximately 4 times.&lt;/li>
&lt;li>&lt;strong>Lower Memory Bandwidth&lt;/strong>: Smaller data types mean less bandwidth is occupied when transferring data between memory and computational units, which is crucial for memory bandwidth-sensitive hardware.&lt;/li>
&lt;li>&lt;strong>Accelerated Computation&lt;/strong>: Many modern processors (CPUs, GPUs, TPUs) support integer operations more efficiently than floating-point operations, providing higher throughput and lower latency.&lt;/li>
&lt;li>&lt;strong>Reduced Power Consumption&lt;/strong>: Integer operations typically consume less energy than floating-point operations.&lt;/li>
&lt;/ul>
&lt;h3 id="23-quantization-principles-mapping-and-dequantization">2.3 Quantization Principles: Mapping and Dequantization&lt;/h3>
&lt;p>The core of quantization is mapping a larger range of floating-point values to a smaller range of fixed-point integer values. This process is defined by the following formula:&lt;/p>
&lt;pre>&lt;code>Q(r) = round(r / S + Z)
&lt;/code>&lt;/pre>
&lt;p>Where:&lt;/p>
&lt;ul>
&lt;li>&lt;code>r&lt;/code> is the original floating-point value.&lt;/li>
&lt;li>&lt;code>Q(r)&lt;/code> is the quantized integer value.&lt;/li>
&lt;li>&lt;code>S&lt;/code> is the &lt;strong>Scale factor&lt;/strong>, representing the floating-point value size corresponding to each quantized integer step.&lt;/li>
&lt;li>&lt;code>Z&lt;/code> is the &lt;strong>Zero-point&lt;/strong>, representing the quantized integer value corresponding to floating-point zero.&lt;/li>
&lt;/ul>
&lt;p>When performing calculations, the quantized values need to be dequantized back to the floating-point domain:&lt;/p>
&lt;pre>&lt;code>r' = S * (Q(r) - Z)
&lt;/code>&lt;/pre>
&lt;p>&lt;code>r'&lt;/code> is the dequantized floating-point number, which has some quantization error compared to the original value &lt;code>r&lt;/code>.&lt;/p>
&lt;h3 id="24-symmetric-vs-asymmetric-quantization">2.4 Symmetric vs. Asymmetric Quantization&lt;/h3>
&lt;p>Based on the choice of zero-point, quantization can be divided into two modes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Symmetric Quantization&lt;/strong>: Maps the floating-point range &lt;code>[-abs_max, abs_max]&lt;/code> symmetrically to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is typically 0 (for signed integers) or &lt;code>2^(bits-1)&lt;/code> (for unsigned integer offset). Computation is relatively simple.&lt;/li>
&lt;li>&lt;strong>Asymmetric Quantization&lt;/strong>: Maps the complete floating-point range &lt;code>[min, max]&lt;/code> to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is a floating-point number that can be adjusted according to data distribution. It can more accurately represent asymmetrically distributed data but is slightly more complex in computation.&lt;/li>
&lt;/ul>
&lt;h3 id="25-perlayer-vs-pergroupperchannel-quantization">2.5 Per-Layer vs. Per-Group/Per-Channel Quantization&lt;/h3>
&lt;p>The granularity of calculating scale factor &lt;code>S&lt;/code> and zero-point &lt;code>Z&lt;/code> also affects quantization accuracy:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Per-Layer/Per-Tensor&lt;/strong>: The entire weight tensor (or all weights in a layer) shares the same set of &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This approach is the simplest, but if the value distribution within the tensor is uneven, it may lead to larger errors.&lt;/li>
&lt;li>&lt;strong>Per-Channel&lt;/strong>: For weights in convolutional layers, each output channel uses independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Grouped Quantization&lt;/strong>: The weight tensor is divided into several groups, with each group using independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This is currently a very popular approach in LLM quantization as it achieves a good balance between accuracy and overhead. The group size is a key hyperparameter.&lt;/li>
&lt;/ul>
&lt;h3 id="26-common-quantization-paradigms">2.6 Common Quantization Paradigms&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: This is the most commonly used and convenient quantization method. It is performed after the model has been fully trained, without requiring retraining. PTQ typically needs a small calibration dataset to calculate the optimal quantization parameters (&lt;code>S&lt;/code> and &lt;code>Z&lt;/code>) by analyzing the distribution of weights and activation values.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: This simulates the errors introduced by quantization during the model training process. By inserting pseudo-quantization nodes in the forward pass during training, it allows the model to adapt to the accuracy loss caused by quantization. QAT typically achieves higher accuracy than PTQ but requires a complete training process and dataset, making it more costly.&lt;/li>
&lt;/ul>
&lt;p>Now that we have the basic knowledge of quantization, let's delve into the specific implementations in &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>.&lt;/p>
&lt;h2 id="3-quantization-schemes-in-llamacpp">3. Quantization Schemes in llama.cpp&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is an efficient LLM inference engine written in C/C++, renowned for its excellent cross-platform performance and support for resource-constrained devices. One of its core advantages is its powerful and flexible quantization support, which revolves around its self-developed &lt;code>GGUF&lt;/code> (Georgi Gerganov Universal Format) file format.&lt;/p>
&lt;h3 id="31-gguf-format-and-quantization">3.1 GGUF Format and Quantization&lt;/h3>
&lt;p>GGUF is a binary format specifically designed for LLMs, used to store model metadata, vocabulary, and weights. A key feature is its native support for various quantized weights, allowing different precision tensors to be mixed within the same file. This enables &lt;code>llama.cpp&lt;/code> to directly use quantized weights when loading models, without additional conversion steps.&lt;/p>
&lt;h3 id="32-quantization-type-nomenclature-in-llamacpp">3.2 Quantization Type Nomenclature in &lt;code>llama.cpp&lt;/code>&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> defines a very specific quantization type naming convention, typically in the format &lt;code>Q&amp;lt;bits&amp;gt;_&amp;lt;type&amp;gt;&lt;/code>. Understanding these names is key to mastering &lt;code>llama.cpp&lt;/code> quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q&lt;/code>&lt;/strong>: Represents quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;bits&amp;gt;&lt;/code>&lt;/strong>: Indicates the average number of bits per weight, such as &lt;code>2&lt;/code>, &lt;code>3&lt;/code>, &lt;code>4&lt;/code>, &lt;code>5&lt;/code>, &lt;code>6&lt;/code>, &lt;code>8&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;type&amp;gt;&lt;/code>&lt;/strong>: Indicates the specific quantization method or variant.&lt;/li>
&lt;/ul>
&lt;p>Below are some of the most common quantization types and their explanations:&lt;/p>
&lt;h4 id="321-basic-quantization-types-legacy">3.2.1 Basic Quantization Types (Legacy)&lt;/h4>
&lt;p>These are earlier quantization methods, most of which have now been replaced by &lt;code>K-Quants&lt;/code>, but are still retained for compatibility.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_0&lt;/code>, &lt;code>Q4_1&lt;/code>&lt;/strong>: 4-bit quantization. &lt;code>Q4_1&lt;/code> uses higher precision scale factors than &lt;code>Q4_0&lt;/code>, thus typically achieving higher accuracy.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_0&lt;/code>, &lt;code>Q5_1&lt;/code>&lt;/strong>: 5-bit quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q8_0&lt;/code>&lt;/strong>: 8-bit symmetric quantization using block-wise scale factors. This is one of the quantization types closest to the original &lt;code>FP16&lt;/code> precision and often serves as a benchmark for performance and quality.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q2_K&lt;/code>, &lt;code>Q3_K&lt;/code>, &lt;code>Q4_K&lt;/code>, &lt;code>Q5_K&lt;/code>, &lt;code>Q6_K&lt;/code>&lt;/strong>: These are the &lt;code>K-Quants&lt;/code> series.&lt;/li>
&lt;/ul>
&lt;h4 id="322-kquants-recommended">3.2.2 K-Quants (Recommended)&lt;/h4>
&lt;p>&lt;code>K-Quants&lt;/code> is a more advanced and flexible quantization scheme introduced in &lt;code>llama.cpp&lt;/code>. They achieve better precision preservation at extremely low bit rates through more refined block structures and the concept of super-blocks.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Block&lt;/strong>: Weights are divided into fixed-size blocks (typically 256 weights).&lt;/li>
&lt;li>&lt;strong>Super-block&lt;/strong>: Multiple blocks form a super-block. More detailed quantization parameters (such as min/max scale factors) are stored at the super-block level.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>K-Quants&lt;/code> naming typically includes a suffix like &lt;code>_S&lt;/code>, &lt;code>_M&lt;/code>, &lt;code>_L&lt;/code>, indicating different sizes/complexities:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>S&lt;/code> (Small)&lt;/strong>: The smallest version, typically with the lowest precision.&lt;/li>
&lt;li>&lt;strong>&lt;code>M&lt;/code> (Medium)&lt;/strong>: Medium size, balancing precision and size.&lt;/li>
&lt;li>&lt;strong>&lt;code>L&lt;/code> (Large)&lt;/strong>: The largest version, typically with the highest precision.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Common K-Quants Types:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_K_M&lt;/code>&lt;/strong>: 4-bit K-Quant, medium size. This is currently one of the most commonly used and recommended 4-bit quantization types, achieving a good balance between size and performance.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q4_K_S&lt;/code>&lt;/strong>: 4-bit K-Quant, small version.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_K_M&lt;/code>&lt;/strong>: 5-bit K-Quant, medium size. Provides better precision than 4-bit while being smaller than &lt;code>Q8_0&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q6_K&lt;/code>&lt;/strong>: 6-bit K-Quant. Provides very high precision, close to &lt;code>Q8_0&lt;/code>, but with a smaller size.&lt;/li>
&lt;li>&lt;strong>&lt;code>IQ2_XS&lt;/code>, &lt;code>IQ2_S&lt;/code>, &lt;code>IQ2_XXS&lt;/code>&lt;/strong>: 2-bit quantization variants, where &lt;code>IQ&lt;/code> stands for &amp;ldquo;Inaccurate Quantization,&amp;rdquo; aimed at extreme model compression but with larger precision loss.&lt;/li>
&lt;/ul>
&lt;h3 id="33-how-to-use-the-llamaquantize-tool">3.3 How to Use the &lt;code>llama-quantize&lt;/code> Tool&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a command-line tool called &lt;code>llama-quantize&lt;/code> for converting &lt;code>FP32&lt;/code> or &lt;code>FP16&lt;/code> GGUF models to quantized GGUF models.&lt;/p>
&lt;p>&lt;strong>Basic Usage:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-quantize &amp;lt;input-gguf-file&amp;gt; &amp;lt;output-gguf-file&amp;gt; &amp;lt;quantization-type&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Quantizing an FP16 Model to Q4_K_M&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># First, convert the original model (e.g., PyTorch format) to FP16 GGUF
python3 convert.py models/my-model/
# Then, use llama-quantize for quantization
./llama-quantize ./models/my-model/ggml-model-f16.gguf ./models/my-model/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="34-importance-matrix">3.4 Importance Matrix&lt;/h3>
&lt;p>To further reduce precision loss from quantization, &lt;code>llama.cpp&lt;/code> introduced the concept of an importance matrix (&lt;code>imatrix&lt;/code>). This matrix calculates the importance of each weight by running the model on a calibration dataset. During quantization, &lt;code>llama-quantize&lt;/code> references this matrix to apply smaller quantization errors to more important weights, thereby protecting critical information in the model.&lt;/p>
&lt;p>&lt;strong>Using &lt;code>imatrix&lt;/code> for Quantization:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># 1. Generate the importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
# 2. Use imatrix for quantization
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-Q4_K_M-imatrix.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="35-summary">3.5 Summary&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code>'s quantization scheme is centered around the &lt;code>GGUF&lt;/code> format, providing a rich, efficient, and battle-tested set of quantization types. Its &lt;code>K-Quants&lt;/code> series performs exceptionally well in low-bit quantization, and when combined with advanced techniques like importance matrices, it can maximize model performance while significantly compressing the model. For scenarios requiring LLM deployment on CPUs or resource-limited hardware, &lt;code>llama.cpp&lt;/code> is an excellent choice.&lt;/p>
&lt;h2 id="4-vllms-quantization-ecosystem">4. vLLM's Quantization Ecosystem&lt;/h2>
&lt;p>Unlike &lt;code>llama.cpp&lt;/code>'s cohesive, self-contained quantization system, &lt;code>vLLM&lt;/code>, as a service engine focused on high-performance, high-throughput GPU inference, adopts a &amp;ldquo;best of all worlds&amp;rdquo; quantization strategy. &lt;code>vLLM&lt;/code> doesn't invent new quantization formats but instead embraces compatibility, supporting and integrating the most mainstream and cutting-edge quantization schemes and tool libraries from academia and industry.&lt;/p>
&lt;h3 id="41-mainstream-quantization-schemes-supported-by-vllm">4.1 Mainstream Quantization Schemes Supported by vLLM&lt;/h3>
&lt;p>&lt;code>vLLM&lt;/code> supports directly loading models quantized by various popular algorithms and tool libraries:&lt;/p>
&lt;h4 id="411-gptq-generalpurpose-posttraining-quantization">4.1.1 GPTQ (General-purpose Post-Training Quantization)&lt;/h4>
&lt;p>GPTQ is one of the earliest widely applied LLM PTQ algorithms. It quantizes weights column by column and updates weights using Hessian matrix information to minimize quantization error.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Iteratively quantize each column of weights and update the remaining unquantized weights to compensate for errors introduced by already quantized columns.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load GPTQ quantized models generated by libraries like &lt;code>AutoGPTQ&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing good 4-bit quantization performance with a large number of pre-quantized models available in the community.&lt;/li>
&lt;/ul>
&lt;h4 id="412-awq-activationaware-weight-quantization">4.1.2 AWQ (Activation-aware Weight Quantization)&lt;/h4>
&lt;p>AWQ observes that not all weights in a model are equally important, with a small portion of &amp;ldquo;significant weights&amp;rdquo; having a huge impact on model performance. Similar uneven distributions also exist in activation values.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: By analyzing the scale of activation values, identify and protect those &amp;ldquo;significant weights&amp;rdquo; that multiply with large activation values, giving them higher precision during quantization. It doesn't quantize activation values but makes weights adapt to the distribution of activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load AWQ quantized models generated by the &lt;code>AutoAWQ&lt;/code> library.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Seeking higher model precision than GPTQ at extremely low bits (such as 4-bit), especially when handling complex tasks.&lt;/li>
&lt;/ul>
&lt;h4 id="413-fp8-8bit-floating-point">4.1.3 FP8 (8-bit Floating Point)&lt;/h4>
&lt;p>FP8 is the latest low-precision floating-point format, pushed by hardware manufacturers like NVIDIA. It has a wider dynamic range than traditional &lt;code>INT8&lt;/code>, making it more suitable for representing extremely unevenly distributed activation values in LLMs.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Use 8-bit floating-point numbers (typically in &lt;code>E4M3&lt;/code> or &lt;code>E5M2&lt;/code> format) to represent weights and/or activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Through integration with &lt;code>llm-compressor&lt;/code> and AMD's &lt;code>Quark&lt;/code> library, &lt;code>vLLM&lt;/code> provides strong support for FP8, including both dynamic and static quantization.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing ultimate inference speed and throughput on modern GPUs (such as H100) that support FP8 acceleration.&lt;/li>
&lt;/ul>
&lt;h4 id="414-fp8-kv-cache">4.1.4 FP8 KV Cache&lt;/h4>
&lt;p>This is a quantization technique specifically targeting the KV Cache, a major memory consumer during inference.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Quantize the Key-Value cache stored in GPU memory from &lt;code>FP16&lt;/code> or &lt;code>BF16&lt;/code> to &lt;code>FP8&lt;/code>, thereby halving this portion of memory usage, allowing the model to support longer context windows or larger batch sizes.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> provides native support, which can be enabled at startup with the parameter &lt;code>--kv-cache-dtype fp8&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="415-bitsandbytes">4.1.5 BitsAndBytes&lt;/h4>
&lt;p>This is a very popular quantization library, known for its ease of use and &amp;ldquo;on-the-fly&amp;rdquo; quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Dynamically quantize during model loading, without needing pre-prepared quantized model files.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> integrates &lt;code>BitsAndBytes&lt;/code>, allowing users to easily enable 4-bit quantization by setting the &lt;code>quantization=&amp;quot;bitsandbytes&amp;quot;&lt;/code> parameter.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Quick experimentation, user-friendly, avoiding complex offline quantization processes.&lt;/li>
&lt;/ul>
&lt;h4 id="416-other-schemes">4.1.6 Other Schemes&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>SqueezeLLM&lt;/strong>: A non-uniform quantization method that believes weight importance is related to numerical size, thus using fewer bits for smaller weight values and more bits for larger weight values.&lt;/li>
&lt;li>&lt;strong>TorchAO&lt;/strong>: PyTorch's official quantization tool library, which &lt;code>vLLM&lt;/code> is beginning to support.&lt;/li>
&lt;li>&lt;strong>BitBLAS&lt;/strong>: A low-level computation library aimed at accelerating low-bit (such as 1-bit, 2-bit, 4-bit) matrix operations through optimized kernel functions.&lt;/li>
&lt;/ul>
&lt;h3 id="42-how-to-use-quantized-models-in-vllm">4.2 How to Use Quantized Models in vLLM&lt;/h3>
&lt;p>Using quantization in &lt;code>vLLM&lt;/code> is very simple, typically just requiring specifying the &lt;code>quantization&lt;/code> parameter in the &lt;code>LLM&lt;/code> constructor. &lt;code>vLLM&lt;/code> will automatically detect the quantization type from the model's configuration file (&lt;code>config.json&lt;/code>).&lt;/p>
&lt;p>&lt;strong>Example: Loading an AWQ Quantized Model&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
# vLLM will automatically recognize awq quantization from &amp;quot;TheBloke/My-Model-AWQ&amp;quot;'s config.json
llm = LLM(model=&amp;quot;TheBloke/My-Model-AWQ&amp;quot;, quantization=&amp;quot;awq&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Enabling FP8 KV Cache&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="5-llamacpp-vs-vllm-comparison-and-summary">5. llama.cpp vs. vLLM: Comparison and Summary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">llama.cpp&lt;/th>
&lt;th align="left">vLLM&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Target Platform&lt;/strong>&lt;/td>
&lt;td align="left">CPU, Cross-platform, Edge devices&lt;/td>
&lt;td align="left">High-performance GPU servers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Core Philosophy&lt;/strong>&lt;/td>
&lt;td align="left">Cohesive, self-contained, extreme optimization&lt;/td>
&lt;td align="left">Open, integrated, high throughput&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>File Format&lt;/strong>&lt;/td>
&lt;td align="left">GGUF (custom format)&lt;/td>
&lt;td align="left">Standard Hugging Face format&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Quantization Schemes&lt;/strong>&lt;/td>
&lt;td align="left">Built-in &lt;code>K-Quants&lt;/code>, &lt;code>IQ&lt;/code>, etc.&lt;/td>
&lt;td align="left">Integrates GPTQ, AWQ, FP8, BnB, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ease of Use&lt;/strong>&lt;/td>
&lt;td align="left">Requires &lt;code>llama-quantize&lt;/code> conversion&lt;/td>
&lt;td align="left">Direct loading, automatic detection&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ecosystem&lt;/strong>&lt;/td>
&lt;td align="left">Self-contained ecosystem&lt;/td>
&lt;td align="left">Embraces the entire Python AI ecosystem&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Latest Technology&lt;/strong>&lt;/td>
&lt;td align="left">Quickly follows up and implements own versions&lt;/td>
&lt;td align="left">Quickly integrates latest open-source libraries&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-latest-quantization-trends-and-outlook">6. Latest Quantization Trends and Outlook&lt;/h2>
&lt;p>The field of model quantization is still rapidly evolving. Here are some trends worth noting:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>1-bit/Binary Neural Networks (BNNs)&lt;/strong>: Ultimate model compression, restricting weights to +1 or -1. Although currently suffering significant precision loss in LLMs, its potential is enormous, with related research emerging constantly.&lt;/li>
&lt;li>&lt;strong>Non-uniform Quantization&lt;/strong>: Like SqueezeLLM, dynamically allocating bit numbers based on data distribution, theoretically superior to uniform quantization.&lt;/li>
&lt;li>&lt;strong>Hardware-Algorithm Co-design&lt;/strong>: New hardware (such as FP8, FP4, INT4 support) is driving the development of new quantization algorithms, while new algorithms are guiding future hardware design.&lt;/li>
&lt;li>&lt;strong>Combining Quantization with Sparsification&lt;/strong>: Combining quantization with sparsification techniques like pruning holds promise for achieving higher rates of model compression.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Model quantization is a key technology for addressing the challenges of the large model era. &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code> represent two different quantization philosophies: &lt;code>llama.cpp&lt;/code> provides ultimate local inference performance for resource-constrained devices through its elegant GGUF format and built-in K-Quants; while &lt;code>vLLM&lt;/code> has become the king of GPU cloud inference services through its open ecosystem and integration of various cutting-edge quantization schemes.&lt;/p>
&lt;p>Understanding the quantization implementations of these two frameworks not only helps us choose the right tool for specific scenarios but also gives us insight into the development trajectory and future directions of the entire LLM inference optimization field.&lt;/p></description></item><item><title>Llama.cpp Technical Guide: Lightweight LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</link><pubDate>Thu, 26 Jun 2025 01:06:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.&lt;/p>
&lt;p>&lt;strong>Core Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Performance:&lt;/strong> Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).&lt;/li>
&lt;li>&lt;strong>Lightweight:&lt;/strong> Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.&lt;/li>
&lt;li>&lt;strong>Cross-Platform:&lt;/strong> Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.&lt;/li>
&lt;li>&lt;strong>Open Ecosystem:&lt;/strong> Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.&lt;/li>
&lt;li>&lt;strong>Continuous Innovation:&lt;/strong> Quickly follows and implements the latest model architectures and inference optimization techniques.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;h3 id="21-gguf-model-format">2.1. GGUF Model Format&lt;/h3>
&lt;p>GGUF (Georgi Gerganov Universal Format) is the core model file format used by &lt;code>llama.cpp&lt;/code>, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.&lt;/p>
&lt;p>&lt;strong>Key Features:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified File:&lt;/strong> Packages model metadata, vocabulary, and all tensors (weights) in a single file.&lt;/li>
&lt;li>&lt;strong>Extensibility:&lt;/strong> Allows adding new metadata without breaking compatibility.&lt;/li>
&lt;li>&lt;strong>Backward Compatibility:&lt;/strong> Guarantees compatibility with older versions of GGUF models.&lt;/li>
&lt;li>&lt;strong>Memory Efficiency:&lt;/strong> Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.&lt;/li>
&lt;/ul>
&lt;h3 id="22-quantization">2.2. Quantization&lt;/h3>
&lt;p>Quantization is one of the core advantages of &lt;code>llama.cpp&lt;/code>. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).&lt;/p>
&lt;p>&lt;strong>Main Benefits:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size:&lt;/strong> Significantly reduces the size of model files, making them easier to distribute and store.&lt;/li>
&lt;li>&lt;strong>Lower Memory Usage:&lt;/strong> Reduces the RAM required to load the model into memory.&lt;/li>
&lt;li>&lt;strong>Faster Inference:&lt;/strong> Low-precision calculations are typically faster than high-precision ones, especially on CPUs.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>llama.cpp&lt;/code> supports various quantization methods, particularly &lt;strong>k-quants&lt;/strong>, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.&lt;/p>
&lt;h3 id="23-multimodal-support">2.3. Multimodal Support&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Supported Models:&lt;/strong> Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.&lt;/li>
&lt;li>&lt;strong>Working Principle:&lt;/strong> Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.&lt;/li>
&lt;li>&lt;strong>Tools:&lt;/strong> &lt;code>llama-mtmd-cli&lt;/code> and &lt;code>llama-server&lt;/code> provide native support for multimodal models.&lt;/li>
&lt;/ul>
&lt;h2 id="3-usage-methods">3. Usage Methods&lt;/h2>
&lt;h3 id="31-compilation">3.1. Compilation&lt;/h3>
&lt;p>Compiling &lt;code>llama.cpp&lt;/code> from source is very simple.&lt;/p>
&lt;pre>&lt;code class="language-bash">git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make
&lt;/code>&lt;/pre>
&lt;p>For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:&lt;/p>
&lt;pre>&lt;code class="language-bash"># For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1
&lt;/code>&lt;/pre>
&lt;h3 id="32-basic-inference">3.2. Basic Inference&lt;/h3>
&lt;p>After compilation, you can use the &lt;code>llama-cli&lt;/code> tool for inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p &amp;quot;Building a website can be done in 10 simple steps:&amp;quot; -n 400
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>-m&lt;/code>: Specifies the path to the GGUF model file.&lt;/li>
&lt;li>&lt;code>-p&lt;/code>: Specifies the prompt.&lt;/li>
&lt;li>&lt;code>-n&lt;/code>: Specifies the maximum number of tokens to generate.&lt;/li>
&lt;/ul>
&lt;h3 id="33-openai-compatible-server">3.3. OpenAI Compatible Server&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.&lt;/p>
&lt;p>Starting the server:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096
&lt;/code>&lt;/pre>
&lt;p>You can then send requests to &lt;code>http://localhost:8080/v1/chat/completions&lt;/code> just like you would with the OpenAI API.&lt;/p>
&lt;h2 id="4-advanced-features">4. Advanced Features&lt;/h2>
&lt;h3 id="41-speculative-decoding">4.1. Speculative Decoding&lt;/h3>
&lt;p>This is an advanced inference optimization technique that significantly accelerates generation speed by using a small &amp;ldquo;draft&amp;rdquo; model to predict the output of the main model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle:&lt;/strong> The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.&lt;/li>
&lt;li>&lt;strong>Usage:&lt;/strong> Use the &lt;code>--draft-model&lt;/code> parameter in &lt;code>llama-cli&lt;/code> or &lt;code>llama-server&lt;/code> to specify a small, fast draft model.&lt;/li>
&lt;/ul>
&lt;h3 id="42-lora-support">4.2. LoRA Support&lt;/h3>
&lt;p>LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. &lt;code>llama.cpp&lt;/code> supports loading one or more LoRA adapters during inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base-model.gguf --lora lora-adapter.gguf
&lt;/code>&lt;/pre>
&lt;p>You can even set different weights for different LoRA adapters:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5
&lt;/code>&lt;/pre>
&lt;h3 id="43-grammars">4.3. Grammars&lt;/h3>
&lt;p>Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Format:&lt;/strong> Uses a format called GBNF (GGML BNF) to define grammar rules.&lt;/li>
&lt;li>&lt;strong>Application:&lt;/strong> By providing GBNF rules through the &lt;code>grammar&lt;/code> parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example:&lt;/strong> Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))
&lt;/code>&lt;/pre>
&lt;h2 id="5-ecosystem">5. Ecosystem&lt;/h2>
&lt;p>The success of &lt;code>llama.cpp&lt;/code> has spawned a vibrant ecosystem:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python&lt;/a>:&lt;/strong> The most popular Python binding, providing interfaces to almost all features of &lt;code>llama.cpp&lt;/code> and deeply integrated with frameworks like LangChain and LlamaIndex.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://ollama.com/">Ollama&lt;/a>:&lt;/strong> A tool for packaging, distributing, and running models, using &lt;code>llama.cpp&lt;/code> under the hood, greatly simplifying the process of running LLMs locally.&lt;/li>
&lt;li>&lt;strong>Numerous UI Tools:&lt;/strong> The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion">6. Conclusion&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), &lt;code>llama.cpp&lt;/code> provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.&lt;/p></description></item></channel></rss>