llama.cpp | Ziyang Lin

Model Quantization Guide: A Comprehensive Analysis from Theory to Practice

Fri, 27 Jun 2025 00:00:00 +0000

1. Introduction

As large language models (LLMs) continue to grow in scale and complexity, their deployment and inference costs have become increasingly expensive. Model quantization, as a key optimization technique, significantly reduces model storage requirements, memory consumption, and computational load by lowering the numerical precision of model weights and activation values, enabling efficient inference on resource-constrained devices such as mobile and edge devices.

This document aims to provide a clear and comprehensive introduction to the core concepts of deep learning model quantization, mainstream approaches, and specific implementations in two leading inference frameworks—llama.cpp and vLLM. We will explore in detail the quantization types they support, underlying principles, usage methods, and future trends in quantization technology.

2. Quantization Fundamentals

Before diving into specific frameworks, we need to understand some basic concepts of quantization.

2.1 What is Model Quantization?

Model quantization refers to the process of converting floating-point numbers in a model (typically 32-bit floating-point, or FP32) to integers with fewer bits (such as INT8, INT4) or lower-precision floating-point numbers (such as FP16, FP8). This process is essentially a form of information compression that attempts to significantly reduce model complexity while preserving model accuracy as much as possible.

2.2 Why is Quantization Needed?

Reduced Model Size: Lower bit-width numerical representations can significantly reduce the size of model files. For example, quantizing an FP32 model to INT8 can reduce the model size by approximately 4 times.
Lower Memory Bandwidth: Smaller data types mean less bandwidth is occupied when transferring data between memory and computational units, which is crucial for memory bandwidth-sensitive hardware.
Accelerated Computation: Many modern processors (CPUs, GPUs, TPUs) support integer operations more efficiently than floating-point operations, providing higher throughput and lower latency.
Reduced Power Consumption: Integer operations typically consume less energy than floating-point operations.

2.3 Quantization Principles: Mapping and Dequantization

The core of quantization is mapping a larger range of floating-point values to a smaller range of fixed-point integer values. This process is defined by the following formula:

Q(r) = round(r / S + Z)

Where:

r is the original floating-point value.
Q(r) is the quantized integer value.
S is the Scale factor, representing the floating-point value size corresponding to each quantized integer step.
Z is the Zero-point, representing the quantized integer value corresponding to floating-point zero.

When performing calculations, the quantized values need to be dequantized back to the floating-point domain:

r' = S * (Q(r) - Z)

r' is the dequantized floating-point number, which has some quantization error compared to the original value r.

2.4 Symmetric vs. Asymmetric Quantization

Based on the choice of zero-point, quantization can be divided into two modes:

Symmetric Quantization: Maps the floating-point range [-abs_max, abs_max] symmetrically to the integer range. In this mode, the zero-point Z is typically 0 (for signed integers) or 2^(bits-1) (for unsigned integer offset). Computation is relatively simple.
Asymmetric Quantization: Maps the complete floating-point range [min, max] to the integer range. In this mode, the zero-point Z is a floating-point number that can be adjusted according to data distribution. It can more accurately represent asymmetrically distributed data but is slightly more complex in computation.

2.5 Per-Layer vs. Per-Group/Per-Channel Quantization

The granularity of calculating scale factor S and zero-point Z also affects quantization accuracy:

Per-Layer/Per-Tensor: The entire weight tensor (or all weights in a layer) shares the same set of S and Z. This approach is the simplest, but if the value distribution within the tensor is uneven, it may lead to larger errors.
Per-Channel: For weights in convolutional layers, each output channel uses independent S and Z.
Grouped Quantization: The weight tensor is divided into several groups, with each group using independent S and Z. This is currently a very popular approach in LLM quantization as it achieves a good balance between accuracy and overhead. The group size is a key hyperparameter.

2.6 Common Quantization Paradigms

Post-Training Quantization (PTQ): This is the most commonly used and convenient quantization method. It is performed after the model has been fully trained, without requiring retraining. PTQ typically needs a small calibration dataset to calculate the optimal quantization parameters (S and Z) by analyzing the distribution of weights and activation values.
Quantization-Aware Training (QAT): This simulates the errors introduced by quantization during the model training process. By inserting pseudo-quantization nodes in the forward pass during training, it allows the model to adapt to the accuracy loss caused by quantization. QAT typically achieves higher accuracy than PTQ but requires a complete training process and dataset, making it more costly.

Now that we have the basic knowledge of quantization, let's delve into the specific implementations in llama.cpp and vLLM.

3. Quantization Schemes in llama.cpp

llama.cpp is an efficient LLM inference engine written in C/C++, renowned for its excellent cross-platform performance and support for resource-constrained devices. One of its core advantages is its powerful and flexible quantization support, which revolves around its self-developed GGUF (Georgi Gerganov Universal Format) file format.

3.1 GGUF Format and Quantization

GGUF is a binary format specifically designed for LLMs, used to store model metadata, vocabulary, and weights. A key feature is its native support for various quantized weights, allowing different precision tensors to be mixed within the same file. This enables llama.cpp to directly use quantized weights when loading models, without additional conversion steps.

3.2 Quantization Type Nomenclature in `llama.cpp`

llama.cpp defines a very specific quantization type naming convention, typically in the format Q<bits>_<type>. Understanding these names is key to mastering llama.cpp quantization.

Q: Represents quantization.
<bits>: Indicates the average number of bits per weight, such as 2, 3, 4, 5, 6, 8.
<type>: Indicates the specific quantization method or variant.

Below are some of the most common quantization types and their explanations:

3.2.1 Basic Quantization Types (Legacy)

These are earlier quantization methods, most of which have now been replaced by K-Quants, but are still retained for compatibility.

Q4_0, Q4_1: 4-bit quantization. Q4_1 uses higher precision scale factors than Q4_0, thus typically achieving higher accuracy.
Q5_0, Q5_1: 5-bit quantization.
Q8_0: 8-bit symmetric quantization using block-wise scale factors. This is one of the quantization types closest to the original FP16 precision and often serves as a benchmark for performance and quality.
Q2_K, Q3_K, Q4_K, Q5_K, Q6_K: These are the K-Quants series.

3.2.2 K-Quants (Recommended)

K-Quants is a more advanced and flexible quantization scheme introduced in llama.cpp. They achieve better precision preservation at extremely low bit rates through more refined block structures and the concept of super-blocks.

Block: Weights are divided into fixed-size blocks (typically 256 weights).
Super-block: Multiple blocks form a super-block. More detailed quantization parameters (such as min/max scale factors) are stored at the super-block level.

K-Quants naming typically includes a suffix like _S, _M, _L, indicating different sizes/complexities:

S (Small): The smallest version, typically with the lowest precision.
M (Medium): Medium size, balancing precision and size.
L (Large): The largest version, typically with the highest precision.

Common K-Quants Types:

Q4_K_M: 4-bit K-Quant, medium size. This is currently one of the most commonly used and recommended 4-bit quantization types, achieving a good balance between size and performance.
Q4_K_S: 4-bit K-Quant, small version.
Q5_K_M: 5-bit K-Quant, medium size. Provides better precision than 4-bit while being smaller than Q8_0.
Q6_K: 6-bit K-Quant. Provides very high precision, close to Q8_0, but with a smaller size.
IQ2_XS, IQ2_S, IQ2_XXS: 2-bit quantization variants, where IQ stands for “Inaccurate Quantization,” aimed at extreme model compression but with larger precision loss.

3.3 How to Use the `llama-quantize` Tool

llama.cpp provides a command-line tool called llama-quantize for converting FP32 or FP16 GGUF models to quantized GGUF models.

Basic Usage:

./llama-quantize <input-gguf-file> <output-gguf-file> <quantization-type>

Example: Quantizing an FP16 Model to Q4_K_M

# First, convert the original model (e.g., PyTorch format) to FP16 GGUF
python3 convert.py models/my-model/
# Then, use llama-quantize for quantization
./llama-quantize ./models/my-model/ggml-model-f16.gguf ./models/my-model/ggml-model-Q4_K_M.gguf Q4_K_M

3.4 Importance Matrix

To further reduce precision loss from quantization, llama.cpp introduced the concept of an importance matrix (imatrix). This matrix calculates the importance of each weight by running the model on a calibration dataset. During quantization, llama-quantize references this matrix to apply smaller quantization errors to more important weights, thereby protecting critical information in the model.

Using imatrix for Quantization:

# 1. Generate the importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
# 2. Use imatrix for quantization
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-Q4_K_M-imatrix.gguf Q4_K_M

3.5 Summary

llama.cpp's quantization scheme is centered around the GGUF format, providing a rich, efficient, and battle-tested set of quantization types. Its K-Quants series performs exceptionally well in low-bit quantization, and when combined with advanced techniques like importance matrices, it can maximize model performance while significantly compressing the model. For scenarios requiring LLM deployment on CPUs or resource-limited hardware, llama.cpp is an excellent choice.

4. vLLM's Quantization Ecosystem

Unlike llama.cpp's cohesive, self-contained quantization system, vLLM, as a service engine focused on high-performance, high-throughput GPU inference, adopts a “best of all worlds” quantization strategy. vLLM doesn't invent new quantization formats but instead embraces compatibility, supporting and integrating the most mainstream and cutting-edge quantization schemes and tool libraries from academia and industry.

4.1 Mainstream Quantization Schemes Supported by vLLM

vLLM supports directly loading models quantized by various popular algorithms and tool libraries:

4.1.1 GPTQ (General-purpose Post-Training Quantization)

GPTQ is one of the earliest widely applied LLM PTQ algorithms. It quantizes weights column by column and updates weights using Hessian matrix information to minimize quantization error.

Core Idea: Iteratively quantize each column of weights and update the remaining unquantized weights to compensate for errors introduced by already quantized columns.
vLLM Support: Can directly load GPTQ quantized models generated by libraries like AutoGPTQ.
Suitable Scenarios: Pursuing good 4-bit quantization performance with a large number of pre-quantized models available in the community.

4.1.2 AWQ (Activation-aware Weight Quantization)

AWQ observes that not all weights in a model are equally important, with a small portion of “significant weights” having a huge impact on model performance. Similar uneven distributions also exist in activation values.

Core Idea: By analyzing the scale of activation values, identify and protect those “significant weights” that multiply with large activation values, giving them higher precision during quantization. It doesn't quantize activation values but makes weights adapt to the distribution of activation values.
vLLM Support: Can directly load AWQ quantized models generated by the AutoAWQ library.
Suitable Scenarios: Seeking higher model precision than GPTQ at extremely low bits (such as 4-bit), especially when handling complex tasks.

4.1.3 FP8 (8-bit Floating Point)

FP8 is the latest low-precision floating-point format, pushed by hardware manufacturers like NVIDIA. It has a wider dynamic range than traditional INT8, making it more suitable for representing extremely unevenly distributed activation values in LLMs.

Core Idea: Use 8-bit floating-point numbers (typically in E4M3 or E5M2 format) to represent weights and/or activation values.
vLLM Support: Through integration with llm-compressor and AMD's Quark library, vLLM provides strong support for FP8, including both dynamic and static quantization.
Suitable Scenarios: Pursuing ultimate inference speed and throughput on modern GPUs (such as H100) that support FP8 acceleration.

4.1.4 FP8 KV Cache

This is a quantization technique specifically targeting the KV Cache, a major memory consumer during inference.

Core Idea: Quantize the Key-Value cache stored in GPU memory from FP16 or BF16 to FP8, thereby halving this portion of memory usage, allowing the model to support longer context windows or larger batch sizes.
vLLM Support: vLLM provides native support, which can be enabled at startup with the parameter --kv-cache-dtype fp8.

4.1.5 BitsAndBytes

This is a very popular quantization library, known for its ease of use and “on-the-fly” quantization.

Core Idea: Dynamically quantize during model loading, without needing pre-prepared quantized model files.
vLLM Support: vLLM integrates BitsAndBytes, allowing users to easily enable 4-bit quantization by setting the quantization="bitsandbytes" parameter.
Suitable Scenarios: Quick experimentation, user-friendly, avoiding complex offline quantization processes.

4.1.6 Other Schemes

SqueezeLLM: A non-uniform quantization method that believes weight importance is related to numerical size, thus using fewer bits for smaller weight values and more bits for larger weight values.
TorchAO: PyTorch's official quantization tool library, which vLLM is beginning to support.
BitBLAS: A low-level computation library aimed at accelerating low-bit (such as 1-bit, 2-bit, 4-bit) matrix operations through optimized kernel functions.

4.2 How to Use Quantized Models in vLLM

Using quantization in vLLM is very simple, typically just requiring specifying the quantization parameter in the LLM constructor. vLLM will automatically detect the quantization type from the model's configuration file (config.json).

Example: Loading an AWQ Quantized Model

from vllm import LLM
# vLLM will automatically recognize awq quantization from "TheBloke/My-Model-AWQ"'s config.json
llm = LLM(model="TheBloke/My-Model-AWQ", quantization="awq")

Example: Enabling FP8 KV Cache

from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8")

5. llama.cpp vs. vLLM: Comparison and Summary

Feature	llama.cpp	vLLM
Target Platform	CPU, Cross-platform, Edge devices	High-performance GPU servers
Core Philosophy	Cohesive, self-contained, extreme optimization	Open, integrated, high throughput
File Format	GGUF (custom format)	Standard Hugging Face format
Quantization Schemes	Built-in `K-Quants`, `IQ`, etc.	Integrates GPTQ, AWQ, FP8, BnB, etc.
Ease of Use	Requires `llama-quantize` conversion	Direct loading, automatic detection
Ecosystem	Self-contained ecosystem	Embraces the entire Python AI ecosystem
Latest Technology	Quickly follows up and implements own versions	Quickly integrates latest open-source libraries

6. Latest Quantization Trends and Outlook

The field of model quantization is still rapidly evolving. Here are some trends worth noting:

1-bit/Binary Neural Networks (BNNs): Ultimate model compression, restricting weights to +1 or -1. Although currently suffering significant precision loss in LLMs, its potential is enormous, with related research emerging constantly.
Non-uniform Quantization: Like SqueezeLLM, dynamically allocating bit numbers based on data distribution, theoretically superior to uniform quantization.
Hardware-Algorithm Co-design: New hardware (such as FP8, FP4, INT4 support) is driving the development of new quantization algorithms, while new algorithms are guiding future hardware design.
Combining Quantization with Sparsification: Combining quantization with sparsification techniques like pruning holds promise for achieving higher rates of model compression.

7. Conclusion

Model quantization is a key technology for addressing the challenges of the large model era. llama.cpp and vLLM represent two different quantization philosophies: llama.cpp provides ultimate local inference performance for resource-constrained devices through its elegant GGUF format and built-in K-Quants; while vLLM has become the king of GPU cloud inference services through its open ecosystem and integration of various cutting-edge quantization schemes.

Understanding the quantization implementations of these two frameworks not only helps us choose the right tool for specific scenarios but also gives us insight into the development trajectory and future directions of the entire LLM inference optimization field.

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

Thu, 26 Jun 2025 01:06:00 +0000

1. Introduction

Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.

Core Advantages:

High Performance: Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).
Lightweight: Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.
Cross-Platform: Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.
Open Ecosystem: Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.
Continuous Innovation: Quickly follows and implements the latest model architectures and inference optimization techniques.

2. Core Concepts

2.1. GGUF Model Format

GGUF (Georgi Gerganov Universal Format) is the core model file format used by llama.cpp, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.

Key Features:

Unified File: Packages model metadata, vocabulary, and all tensors (weights) in a single file.
Extensibility: Allows adding new metadata without breaking compatibility.
Backward Compatibility: Guarantees compatibility with older versions of GGUF models.
Memory Efficiency: Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.

2.2. Quantization

Quantization is one of the core advantages of llama.cpp. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).

Main Benefits:

Reduced Model Size: Significantly reduces the size of model files, making them easier to distribute and store.
Lower Memory Usage: Reduces the RAM required to load the model into memory.
Faster Inference: Low-precision calculations are typically faster than high-precision ones, especially on CPUs.

llama.cpp supports various quantization methods, particularly k-quants, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.

2.3. Multimodal Support

llama.cpp is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.

Supported Models: Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.
Working Principle: Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.
Tools: llama-mtmd-cli and llama-server provide native support for multimodal models.

3. Usage Methods

3.1. Compilation

Compiling llama.cpp from source is very simple.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make

For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:

# For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1

3.2. Basic Inference

After compilation, you can use the llama-cli tool for inference.

./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400

-m: Specifies the path to the GGUF model file.
-p: Specifies the prompt.
-n: Specifies the maximum number of tokens to generate.

3.3. OpenAI Compatible Server

llama.cpp provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.

Starting the server:

./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096

You can then send requests to http://localhost:8080/v1/chat/completions just like you would with the OpenAI API.

4. Advanced Features

4.1. Speculative Decoding

This is an advanced inference optimization technique that significantly accelerates generation speed by using a small “draft” model to predict the output of the main model.

Working Principle: The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.
Usage: Use the --draft-model parameter in llama-cli or llama-server to specify a small, fast draft model.

4.2. LoRA Support

LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. llama.cpp supports loading one or more LoRA adapters during inference.

./llama-cli -m base-model.gguf --lora lora-adapter.gguf

You can even set different weights for different LoRA adapters:

./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5

4.3. Grammars

Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.

Format: Uses a format called GBNF (GGML BNF) to define grammar rules.
Application: By providing GBNF rules through the grammar parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.

Example: Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.

import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))

5. Ecosystem

The success of llama.cpp has spawned a vibrant ecosystem:

llama-cpp-python: The most popular Python binding, providing interfaces to almost all features of llama.cpp and deeply integrated with frameworks like LangChain and LlamaIndex.
Ollama: A tool for packaging, distributing, and running models, using llama.cpp under the hood, greatly simplifying the process of running LLMs locally.
Numerous UI Tools: The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.

6. Conclusion

llama.cpp is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), llama.cpp provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.

llama.cpp | Ziyang Lin

Model Quantization Guide: A Comprehensive Analysis from Theory to Practice

1. Introduction

2. Quantization Fundamentals

2.1 What is Model Quantization?

2.2 Why is Quantization Needed?

2.3 Quantization Principles: Mapping and Dequantization

2.4 Symmetric vs. Asymmetric Quantization

2.5 Per-Layer vs. Per-Group/Per-Channel Quantization

2.6 Common Quantization Paradigms

3. Quantization Schemes in llama.cpp

3.1 GGUF Format and Quantization

3.2 Quantization Type Nomenclature in llama.cpp

3.2.1 Basic Quantization Types (Legacy)

3.2.2 K-Quants (Recommended)

3.3 How to Use the llama-quantize Tool

3.4 Importance Matrix

3.5 Summary

4. vLLM's Quantization Ecosystem

4.1 Mainstream Quantization Schemes Supported by vLLM

4.1.1 GPTQ (General-purpose Post-Training Quantization)

4.1.2 AWQ (Activation-aware Weight Quantization)

4.1.3 FP8 (8-bit Floating Point)

4.1.4 FP8 KV Cache

4.1.5 BitsAndBytes

4.1.6 Other Schemes

4.2 How to Use Quantized Models in vLLM

5. llama.cpp vs. vLLM: Comparison and Summary

6. Latest Quantization Trends and Outlook

7. Conclusion

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

1. Introduction

2. Core Concepts

2.1. GGUF Model Format

2.2. Quantization

2.3. Multimodal Support

3. Usage Methods

3.1. Compilation

3.2. Basic Inference

3.3. OpenAI Compatible Server

4. Advanced Features

4.1. Speculative Decoding

4.2. LoRA Support

4.3. Grammars

5. Ecosystem

6. Conclusion

3.2 Quantization Type Nomenclature in `llama.cpp`

3.3 How to Use the `llama-quantize` Tool