Model Inference | Ziyang Lin

TensorRT In-Depth: High-Performance Deep Learning Inference Engine

Mon, 30 Jun 2025 06:00:00 +0000

1. Introduction

NVIDIA® TensorRT™ is a software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It is designed to optimize and accelerate trained neural networks, enabling them to run in production environments with low latency and high throughput. TensorRT takes models from mainstream deep learning frameworks (such as TensorFlow, PyTorch, ONNX, etc.), applies a series of sophisticated optimization techniques, and generates a highly optimized runtime engine.

This document will provide an in-depth yet accessible introduction to TensorRT's core concepts, key features, workflow, and latest functionalities (including TensorRT-LLM specifically designed for accelerating large language models), helping developers fully leverage its powerful performance advantages.

2. Core Concepts

Understanding TensorRT's core components is the first step to using it effectively.

Engine: The core of TensorRT. It is an optimized model representation that includes a computation graph and weights generated for a specific GPU architecture and configuration (such as batch size, precision). The Engine is immutable and is the final product for deployment.
Builder (IBuilder): This is the main interface for creating an Engine. The Builder takes a network definition and applies various optimizations, ultimately generating an optimized plan for the target GPU, which can be serialized into an Engine.
Network Definition (INetworkDefinition): This is where you define the model structure. You can build the network manually from scratch or import it from a model file using a Parser.
Parser: Used to parse models from different frameworks (primarily ONNX format) and convert them into TensorRT's network definition. TensorRT provides a powerful ONNX parser.
Profiler (IProfiler): An optional interface that allows you to collect and query information about layer performance during the build process. This helps with debugging and understanding which layers are performance bottlenecks.
Execution Context (IExecutionContext): This is the main interface for executing inference. An Engine can have multiple Execution Contexts, allowing concurrent execution of inference tasks. Each context maintains its own inputs, outputs, and state.

graph TD;
subgraph "Model Building Offline"
A[Original Model<br>TensorFlow/PyTorch] --> B{ONNX Parser};
B --> C[Network Definition];
C --> D[Builder];
D -- Optimization Config --> E[Optimized Plan];
E --> F((Engine));
end
subgraph "Inference Deployment Online"
F --> G[Execution Context];
H[Input Data] --> G;
G --> I[Output Results];
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#ccf,stroke:#333,stroke-width:2px

3. Key Features and Optimization Techniques

TensorRT's high performance stems from its advanced optimization techniques.

3.1. Precision Calibration & Quantization

TensorRT supports multiple precisions for inference, including FP32, FP16, INT8, and the latest FP8. Among these, INT8 quantization is a key technology for improving performance and reducing memory usage.

Post-Training Quantization (PTQ): Determines the scaling factors needed to convert FP32 weights and activation values to INT8 through a calibration dataset, without retraining the model.
Quantization-Aware Training (QAT): Simulates quantization operations during training, making the model more robust to quantization errors, thus achieving higher accuracy when converted to INT8.

You can use QuantizationSpec to precisely control which layers or types of layers need to be quantized.

# Example: Only quantize 'Conv2D' type layers
q_spec = QuantizationSpec()
q_spec.add(name='Conv2D', is_keras_class=True)
q_model = quantize_model(model, quantization_mode='partial', quantization_spec=q_spec)

3.2. Layer & Tensor Fusion

TensorRT intelligently merges multiple independent layers into a single, more complex layer. This reduces the number of CUDA kernel launches and memory reads/writes, significantly lowering latency.

Vertical Fusion: Merges consecutive layers with the same data dependencies (such as Conv, Bias, ReLU) into a single CBR layer.

graph TD;
subgraph "Before Fusion"
A[Input] --> B(Conv);
B --> C(Bias);
C --> D(ReLU);
D --> E[Output];
end
subgraph "After Fusion"
A2[Input] --> F((Conv + Bias + ReLU));
F --> E2[Output];
end

Horizontal Fusion: Merges parallel layers that have the same input but perform different operations.

graph TD;
subgraph "Before Fusion"
A[Input] --> B(Conv A);
A --> C(Conv B);
B --> D[Output A];
C --> E[Output B];
end
subgraph "After Fusion"
A2[Input] --> F((Conv A + Conv B));
F --> D2[Output A];
F --> E2[Output B];
end

3.3. Kernel Auto-Tuning

For specific target GPU architectures, TensorRT selects the optimal CUDA kernel for each layer from a library containing multiple implementations. It tests different algorithms and implementations based on the current batch size, input dimensions, and parameters to find the fastest one.

3.4. Dynamic Shapes

TensorRT can handle models with input tensor dimensions that vary at runtime. When building an Engine, you can specify an optimization profile that includes minimum, optimal, and maximum dimensions for inputs. TensorRT will generate an Engine that can efficiently handle any input dimensions within the specified range.

3.5. Plugins

For custom or special layers not natively supported by TensorRT, you can implement your own logic through the plugin API (IPluginV2). This provides great extensibility for TensorRT.

The latest versions of TensorRT have greatly simplified the plugin registration process through decorators, especially for the Python API.

# Example: Register a simple element-wise addition plugin
import tensorrt.plugin as trtp
@trtp.register("sample::elemwise_add_plugin")
def add_plugin_desc(inp0: trtp.TensorDesc, block_size: int) -> trtp.TensorDesc:
return inp0.like()

3.6. Sparsity

TensorRT supports leveraging structured sparsity features on NVIDIA Ampere and higher architecture GPUs. If your model weights have a 2:4 sparsity pattern, TensorRT can utilize sparse tensor cores to further accelerate computation, nearly doubling performance.

4. Workflow

A typical TensorRT deployment workflow is as follows:

sequenceDiagram
participant D as Developer
participant TF as TensorFlow/PyTorch
participant ONNX
participant Poly as Polygraphy
participant TRT as TensorRT (trtexec/API)
participant App as Application
D->>TF: Train Model
TF-->>D: Generate Trained Model
D->>ONNX: Export to ONNX Format
ONNX-->>D: .onnx File
D->>Poly: Use Polygraphy to Check and Optimize
Poly-->>D: Optimized .onnx File
D->>TRT: Build Engine (FP16/INT8)
TRT-->>D: Generate .engine File
D->>App: Deploy Engine
App->>App: Load Engine and Create Execution Context
loop Inference Loop
App->>App: Prepare Input Data
App->>App: Execute Inference
App->>App: Get Output Results
end

Model Export: Export your trained model from your training framework (such as PyTorch or TensorFlow) to ONNX format. ONNX is an open model exchange format that serves as a bridge between training and inference.
Model Inspection and Optimization (Polygraphy): Before building an Engine, it is strongly recommended to use the Polygraphy toolkit to inspect, modify, and optimize your ONNX model. Polygraphy is a powerful tool that can:
- Inspect Models: Display information about the model's layers, inputs, outputs, etc.
- Constant Folding: Pre-compute constant expressions in the model, simplifying the computation graph.
```
polygraphy surgeon sanitize model.onnx -o folded.onnx --fold-constants
```
- Compare Outputs from Different Frameworks: Verify that TensorRT's output is consistent with the original framework (such as ONNX Runtime) to troubleshoot precision issues.
```
polygraphy run model.onnx --trt --onnxrt
```
- Handle Data-Dependent Shapes (DDS): Identify and set upper bounds for tensors with data-dependent shapes.
Build Engine: Use the trtexec command-line tool or TensorRT's C++/Python API to build an Engine.
- trtexec: A convenient command-line tool for quickly building an Engine from an ONNX file and conducting performance benchmarking.
```
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
```
- API: Provides more flexible control, such as defining optimization profiles for dynamic shapes, configuring plugins, etc.

Deployment and Inference: Load the serialized Engine file into your application and use an Execution Context to perform inference.

# Using Polygraphy's TrtRunner for inference
from polygraphy.backend.trt import TrtRunner, EngineFromBytes
# Load Engine
engine = EngineFromBytes(open("model.engine", "rb").read())
with TrtRunner(engine) as runner:
# Prepare input data
feed_dict = {"input_name": input_data}
# Execute inference
outputs = runner.infer(feed_dict=feed_dict)

5. Latest Feature Highlights

TensorRT is rapidly iterating, and here are some of the latest important features:

Polygraphy Tool Enhancements:
- Simplified CLI Syntax: Allows specifying both script and function name in a single parameter (my_script.py:my_func).
- Improved Input Specification: Uses a new list-style syntax (--input-shapes input0:[x,y,z]) to avoid ambiguity.
Quickly Deployable Plugins:
- The Python API has introduced `@trtp.registerand@trt.plugin.autotune` decorators, making it unprecedentedly simple to define, register, and auto-tune plugins without writing C++ code.
CUDA Graphs:
- Through the --use-cuda-graph flag, TensorRT can leverage CUDA Graphs to capture the entire inference process, further reducing CPU overhead and kernel launch latency, particularly suitable for scenarios with fixed model structures.
FP8 Support:
- On Hopper and higher architecture GPUs, TensorRT supports FP8 inference, providing higher performance and lower memory usage for large language models and other applications.

6. Appendix: Common Commands

Install Polygraphy:

python3 -m pip install polygraphy --extra-index-url https://pypi.ngc.nvidia.com

Build and Install TensorRT Open Source Components:
```
# From source directory
make install
```
Run pytest Tests:
```
pytest --verbose
```

7. TensorRT-LLM: Born for Large Language Model Inference

As the scale and complexity of large language models (LLMs) grow exponentially, traditional inference optimization methods face unprecedented challenges. To address these challenges, NVIDIA has introduced TensorRT-LLM, an open-source library specifically designed to accelerate and optimize LLM inference. It is built on top of TensorRT and encapsulates a series of cutting-edge optimization techniques for LLMs.

7.1. What is TensorRT-LLM?

TensorRT-LLM can be thought of as an “LLM expert version” of TensorRT. It provides a Python API that allows developers to easily define LLM models and automatically apply various state-of-the-art optimizations. Ultimately, it generates a high-performance TensorRT engine that can be directly deployed.

Unlike general TensorRT which mainly handles static graphs, TensorRT-LLM specifically addresses the dynamic characteristics in LLM inference, such as:

Autoregressive Generation: Each newly generated token depends on the previous tokens, resulting in dynamically changing input sequence lengths.
Enormous Model Scale: Model parameters often number in the billions or even hundreds of billions, making it impossible to deploy on a single GPU.
Massive KV Cache: The inference process requires storing a large number of key-value pairs (Key-Value Cache), placing extremely high demands on memory bandwidth and capacity.

7.2. Core Architecture and Components

TensorRT-LLM's architecture is divided into frontend and backend:

Python API (tensorrt_llm): This is the main interface for user interaction. It defines models in a declarative way (similar to PyTorch), allowing developers to avoid dealing with the complex underlying TensorRT C++ API.
C++ Backend: This is the core that actually performs the optimization, containing pre-written, highly optimized CUDA kernels, LLM-specific optimization passes, and a runtime that can efficiently handle LLM tasks.

graph TD;
subgraph "Frontend (Python API)"
A[Hugging Face / Custom Model] -->|Weights| B(Model Definition<br>tensorrt_llm.Module);
B --> C{Builder};
C -- Generate Network and Config --> D[Network Definition];
end
subgraph "Backend (C++ Runtime)"
D --> E[TensorRT-LLM Optimization];
E --> F((LLM Optimized Engine));
end
subgraph "Inference"
F --> G[C++/Python Runtime];
H[Input Prompts] --> G;
G --> I[Output Tokens];
end
style F fill:#c9f,stroke:#333,stroke-width:2px

7.3. Key Optimization Techniques (LLM-Specific)

The magic of TensorRT-LLM lies in its optimization techniques specifically designed for LLMs.

7.3.1. In-Flight Batching (also known as Continuous Batching)

Problem: Traditional static batching requires all requests to wait until a batch is formed before processing them together. Due to the varying generation lengths of each request, this leads to significant GPU idle time (“bubbles”), as the batch must wait for the slowest request to complete.

Solution: In-Flight Batching allows the server to dynamically add new requests while the GPU is running. Once a request completes, its computational resources are immediately released and allocated to new requests in the waiting queue. This greatly improves GPU utilization and overall system throughput.

gantt
title GPU Utilization Comparison
dateFormat X
axisFormat %S
section Static Batching
Request A: 0, 6
Request B: 0, 3
Request C: 0, 5
GPU Waiting : 3, 3
GPU Waiting : 5, 1
section In-Flight Batching
Request A : 0, 6
Request B : 0, 3
Request C : 0, 5
New Request D : 3, 4

7.3.2. Paged KV Cache & Attention

Problem: In the autoregressive generation process, the KV cache grows linearly with sequence length, consuming large amounts of GPU memory. The traditional approach is to pre-allocate a continuous memory block for each request that can accommodate the maximum sequence length, leading to severe memory fragmentation and waste.

Solution: Inspired by operating system virtual memory paging, TensorRT-LLM introduced Paged KV Cache. It divides the KV cache into fixed-size “blocks” and allocates them as needed.

Non-contiguous Storage: KV caches for logically continuous tokens can be stored in physically non-contiguous blocks.
Memory Sharing: For complex scenarios (such as parallel sampling, Beam Search), different sequences can share the same KV cache blocks (e.g., sharing the cache for the prompt portion), significantly saving memory.
Optimized Attention Kernels: TensorRT-LLM uses specially optimized Attention kernels such as FlashAttention and MQA/GQA that can directly operate on these non-contiguous cache blocks, avoiding data copy overhead.

7.3.3. Tensor & Pipeline Parallelism

For large models that cannot fit on a single GPU, TensorRT-LLM has built-in seamless support for tensor parallelism and pipeline parallelism. Developers only need to specify the parallelism degree (tp_size, pp_size) during building, and TensorRT-LLM will automatically handle model splitting and cross-GPU communication.

# Example: Build a Llama model with 2-way tensor parallelism
python3 examples/llama/convert_checkpoint.py \
--model_dir ./llama-7b-hf \
--output_dir ./tllm_checkpoint_tp2 \
--dtype float16 \
--tp_size 2

7.3.4. Advanced Quantization Support (FP8/INT4/INT8)

The enormous parameter count of LLMs makes them ideal candidates for quantization. TensorRT-LLM supports various advanced quantization schemes:

FP8: On NVIDIA Hopper and higher architecture GPUs, FP8 provides precision close to FP16 while significantly improving performance and reducing memory usage.
INT8 SmoothQuant: A technique that quantizes both activations and weights, achieving INT8 acceleration while maintaining high precision.
INT4/INT8 Weight-Only Quantization (W4A16/W8A16): This is a very popular technique that only quantizes model weights (the largest part of parameters) to INT4 or INT8, while keeping activations in FP16. This greatly reduces memory usage with minimal impact on accuracy.

# Example: Build a model with INT4 weight-only quantization
python convert_checkpoint.py --model_dir ./gpt-j-6b \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./trt_ckpt/gptj_int4wo_tp1/

7.4. TensorRT-LLM Workflow

A typical TensorRT-LLM workflow is as follows:

sequenceDiagram
participant D as Developer
participant HF as Hugging Face Hub
participant Conv as convert_checkpoint.py
participant Build as trtllm-build
participant App as Inference Application (Python/C++)
D->>HF: Download Model Weights
HF-->>D: model_dir
D->>Conv: Run Conversion Script (Specify Precision, Parallelism, etc.)
Conv-->>D: Generate TensorRT-LLM Checkpoint
D->>Build: Run Build Command (Specify Plugins, BatchSize, etc.)
Build-->>D: Generate Optimized .engine File
D->>App: Load Engine and Run Inference
App-->>D: Return Generation Results

End-to-End Example (Using Llama-7B):

Convert Weights:

git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Llama-2-7b-hf \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16

Build Engine:

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama_7b \
--gpt_attention_plugin float16 \
--gemm_plugin float16

Run Inference:

python3 examples/run.py --max_output_len=100 \
--tokenizer_dir ./Llama-2-7b-hf \
--engine_dir=./trt_engines/llama_7b

7.5. Convenient High-Level API (`LLM`)

To further simplify the development process, TensorRT-LLM provides a high-level API called LLM. This interface encapsulates model loading, building, saving, and inference into a simple class, allowing developers to complete all operations in just a few lines of code.

from tensorrt_llm import LLM
# 1. Initialize LLM object, if the engine doesn't exist, it will automatically build from HuggingFace model
# All optimizations like In-Flight Batching, Paged KV-Cache will be applied here
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
)
# 2. (Optional) Save the built engine for later use
llm.save("llama_engine_dir")
# 3. Run inference
prompt = "NVIDIA TensorRT-LLM is"
for output in llm.generate([prompt], max_new_tokens=50):
print(output)

This high-level API is ideal for rapid prototyping and deployment.

7.6. Conclusion

TensorRT-LLM is not simply applying TensorRT to LLMs, but a comprehensive solution fundamentally redesigned for LLM inference, containing multiple state-of-the-art optimizations. Through In-Flight Batching, Paged KV-Cache, native parallel support, and advanced quantization schemes, it can maximize the hardware performance of NVIDIA GPUs, providing a solid foundation for deploying high-performance, high-throughput LLM services.

SGLang Technical Guide: High-Performance Structured Generation Framework

Thu, 26 Jun 2025 01:07:00 +0000

1. SGLang Introduction

SGLang (Structured Generation Language) is a high-performance service framework designed for large language models (LLMs) and vision language models (VLMs). Its core goal is to address the challenges faced by complex LLM programs in real-world applications, maximizing inference performance while maintaining flexibility.

Traditional LLM service frameworks (like vLLM) excel at handling simple, one-shot prompting but face limitations in complex scenarios requiring multi-turn interactions, structured outputs, function calls, or control flow. SGLang effectively bridges this gap by introducing a novel frontend language and an efficient backend runtime.

Core advantages of SGLang include:

Exceptional Performance: SGLang introduces RadixAttention, an innovative attention mechanism that automatically and losslessly reuses key-value caches (KV Cache), significantly improving inference speed in scenarios with complex prompts (like CoT, ReAct) or multi-turn conversations. Compared to leading frameworks like vLLM, SGLang can achieve several times higher throughput in these scenarios.
Powerful Programming Capabilities: SGLang provides an intuitive domain-specific language (DSL) that allows developers to orchestrate complex generation tasks in a Pythonic way. You can easily define variables, use loops and conditional statements, call external tools, and seamlessly integrate these logic elements with the LLM's generation process. This makes building complex AI agents, multi-turn dialogue systems, and structured data extraction tasks unprecedentedly simple.
Unified Frontend-Backend Interface: SGLang decouples frontend programming logic from backend inference services. The frontend defines “what to generate,” while the backend handles “how to efficiently generate it.” This design not only simplifies the development process but also makes SGLang compatible with OpenAI's API standards, allowing users to easily migrate existing applications to SGLang and immediately benefit from performance gains.
Flexible Structured Output: SGLang provides powerful structured output constraint capabilities. Whether through regular expressions, EBNF grammar, or JSON Schema, you can precisely control the output format of the LLM, ensuring that the generated content conforms to the expected structure, which is crucial for applications requiring reliable data formats.

In summary, SGLang is not just an LLM inference acceleration engine but a complete programming and execution framework for complex generation tasks. It aims to enable developers to fully unleash the potential of large language models in an efficient and intuitive way.

2. Core Features

The power of SGLang lies in its unique design, which combines an intuitive frontend programming model with an efficient backend execution engine. Below are detailed introductions to several of its core features.

2.1 RadixAttention: KV Cache Optimization for Complex Prompts

When processing complex LLM programs, such as Chain-of-Thought, multi-turn dialogues, or agents that need to call tools, prompts often contain large shared prefixes. Traditional attention mechanisms produce redundant computation and storage when handling these shared prefixes.

SGLang introduces RadixAttention, a novel KV cache optimization technique. Its core idea is to organize prompts into a radix tree and perform attention calculations on this tree.

Automatic Sharing and Reuse: RadixAttention can automatically identify and share common prefixes between different requests, avoiding duplicate computation and storage. For example, in multi-turn dialogues, the conversation history of each turn can be losslessly reused by subsequent turns.
Performance Improvement: By maximizing KV cache reuse, RadixAttention significantly reduces memory usage and computational load, increasing throughput by 2 to 5 times, especially when handling long prompts or high-concurrency requests.

Below is a Mermaid diagram that visually demonstrates how RadixAttention handles requests with shared prefixes:

graph TD
subgraph "Traditional Method (No Sharing)"
req1["Request 1: 'A B C D'"]
req2["Request 2: 'A B E F'"]
kv1["KV Cache: [A, B, C, D]"]
kv2["KV Cache: [A, B, E, F]"]
req1 --> kv1
req2 --> kv2
end
subgraph "SGLang RadixAttention"
Root("Root") --> A("Token 'A'");
A --> B("Token 'B'");
B --> C("Token 'C'");
B --> E("Token 'E'");
C --> D("Token 'D'");
E --> F("Token 'F'");
style A fill:#9f9
style B fill:#9f9
end

In the diagram above, for two requests 'A B C D' and 'A B E F', the traditional method creates two independent KV caches. RadixAttention, however, organizes them into a tree, sharing the computation and storage of the common prefix 'A B' (green nodes), creating new branches only for the different parts (C, D, E, F). This greatly improves memory and computational efficiency.

2.2 Unified Frontend Programming Language (DSL)

SGLang provides an expressive domain-specific language (DSL) deeply integrated with Python, allowing developers to build complex generation logic in a natural and intuitive way.

SGLang Architecture Overview

To better understand how SGLang works, we can observe its core architecture through the following flowchart:

graph TD
subgraph User Side
A[Developer defines SGLang program<br>using function decorator] --> B{Call run method};
end
subgraph SGLang Frontend
B --> C[1. Parse Python AST<br>Separate deterministic logic and generation instructions];
C --> D[2. Build portable<br>SGLang IR intermediate representation];
end
subgraph Network Communication
D -- HTTP Request --> E[SGLang backend service SRT];
end
subgraph SGLang Backend SRT
E --> F[3. Receive IR and schedule];
F --> G{RadixAttention engine};
G --> H[4. Efficient execution<br>KV cache reuse];
H --> I[LLM/VLM model];
I --> J[5. Generate results];
end
subgraph Return Path
J -- HTTP Response --> K[Return results to frontend];
K --> L[6. Fill state object `s`];
L --> M[User gets final results];
end
style B fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style G fill:#9cf,stroke:#333,stroke-width:2px

This diagram clearly shows how SGLang decouples and combines the programming convenience of the frontend with the high-performance execution engine of the backend.

Pythonic Control Flow: You can directly use standard Python control flow statements like if/else and for loops in SGLang functions to dynamically build prompts.
Integration of Generation and Logic: Through the @function decorator and gen() instruction, SGLang seamlessly combines the LLM's generation process (the “non-deterministic” part) with the program's deterministic logic.

Example: Generating Different Content Based on Conditions

from sglang import function, system, user, assistant, gen
@function
def tool_use(s, question):
s += system("You are a helpful assistant.")
s += user(question)
s += assistant(
"To answer this question, I need to use a "
+ gen("tool", choices=["calculator", "search engine"])
+ ". "
)
if s["tool"] == "calculator":
s += assistant("The math expression is: " + gen("expression"))
elif s["tool"] == "search engine":
s += assistant("The key word to search is: " + gen("word"))
state = tool_use.run("What is the population of London?")
print(state["tool"])
# Output: search engine
print(state["word"])
# Output: population of London

In this example, the program first asks the LLM to choose between “calculator” and “search engine” as a tool, then executes different logic branches based on the LLM's choice, guiding the LLM to generate the next step of content.

2.3 Powerful Structured Output

To ensure that content generated by the LLM can be reliably parsed and used by downstream programs, SGLang provides multiple powerful structured output constraint mechanisms.

Regular Expressions (Regex): You can provide a regular expression to force the model's output to strictly match that pattern. This is useful for generating identifiers, numbers, or simple text fragments in specific formats.

response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
messages=[{"role": "assistant", "content": "What is the capital of France?"}],
extra_body={"regex": "(Paris|London)"},
)
# response.choices[0].message.content will necessarily be "Paris" or "London"

EBNF Grammar: For more complex grammatical structures, you can use Extended Backus-Naur Form (EBNF) to define a complete grammar. This allows you to generate code, DSLs, or other structured text that strictly adheres to specific syntax.

ebnf_grammar = """
root ::= city " is the capital of " country
city ::= "London" | "Paris" | "Berlin" | "Rome"
country ::= "England" | "France" | "Germany" | "Italy"
"""
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Give me the information of the capital of France."}],
extra_body={"ebnf": ebnf_grammar},
)
# response.choices[0].message.content will be "Paris is the capital of France"

JSON Schema: SGLang supports using JSON Schema to constrain the model to generate structured JSON objects. You can directly define a JSON Schema or use a Pydantic model to automatically generate one. This is crucial for APIs and data processing tasks that require reliable, verifiable JSON output.

from pydantic import BaseModel
class CapitalInfo(BaseModel):
name: str
population: int
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
messages=[{"role": "assistant", "content": "Give me the information and population of the capital of France in the JSON format."}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "capital_info",
"schema": CapitalInfo.model_json_schema(),
},
},
)
# response.choices[0].message.content will be a JSON string conforming to the CapitalInfo structure

3. Quick Start

This section will guide you through installing SGLang, starting the service, and basic usage, allowing you to experience SGLang's powerful features in just a few minutes.

3.1 Installation

SGLang can be installed via pip or the faster uv. For the best experience and full functionality, it's recommended to install the all version.

Using pip:

pip install --upgrade pip
pip install "sglang[all]"

Using uv (recommended, faster):

pip install uv
uv pip install "sglang[all]"

Note: The installation process may require compiling CUDA kernels (such as flashinfer). Please ensure that the CUDA_HOME environment variable is correctly configured in your environment and that the CUDA version is compatible with your PyTorch version.

3.2 Starting the Backend Service (SRT)

After installation, the next step is to start SGLang's backend service (SRT, SGLang Runtime). This service will load the specified language model and provide an interface compatible with the OpenAI API.

Run the following command in your terminal:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

Parameter Description:

--model-path: Specifies the path to the model to load. This can be a model name on the Hugging Face Hub (as shown in this example) or a local model path.
--host: The host address the service listens on. 0.0.0.0 means allowing access from any network interface.
--port: The port number the service listens on.

When the service starts successfully, you'll see output similar to the following, indicating that the model has been loaded and is ready to receive requests.

INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.

3.3 Sending Your First Request

With the service running, we can now interact with it using OpenAI's Python client library.

Create a Python file named test_sglang.py and fill it with the following content:

import openai
# Initialize the client, pointing to our locally started SGLang service
client = openai.Client(
base_url="http://127.0.0.1:30000/v1",
api_key="EMPTY" # SGLang service doesn't require an API Key
)
# Create a chat completion request
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct", # Must match the model loaded by the service
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France and why is it famous?"},
],
temperature=0.7,
max_tokens=150,
)
# Print the model's response
print(response.choices[0].message.content)

Run this script:

python test_sglang.py

You'll see the model's detailed answer about Paris. At this point, you've successfully completed the entire process from service deployment to inference request using SGLang!

4. Frontend Language (SGLang DSL)

SGLang's frontend language (DSL) is the core of its usability. It allows you to define complex generation processes in a declarative way, perfectly combining Python's flexibility with the generative capabilities of LLMs.

4.1 `@function` Decorator

All SGLang programs begin with a Python function decorated by @function. This decorator transforms an ordinary Python function into an executable SGLang program template.

State Management: The first parameter of the function (typically named s) represents the current generation state. It's a dictionary-like object used to store and pass all variables produced during the generation process.
Delayed Execution: Functions decorated with @function are not executed immediately when defined. Instead, they create a reusable template. The program only executes when the .run() or .run_batch() method is called.

Interaction Flow

The entire function call interaction flow can be represented by the following sequence diagram:

sequenceDiagram
participant User
participant App as Application (Python)
participant SGLang as SGLang Service
participant Tool as External Tool (e.g., Weather API)
User->>+App: "What's the weather like in Boston?"
App->>+SGLang: Send request with messages and tools
SGLang->>SGLang: Model decides to call get_current_weather
SGLang-->>-App: Return tool_calls with function name and parameters
App->>App: Parse tool_calls
App->>+Tool: Call get_current_weather(city="Boston", unit="fahrenheit")
Tool-->>-App: Return weather result: "68°F"
App->>+SGLang: Send new request with weather result
SGLang->>SGLang: Model generates final reply based on weather result
SGLang-->>-App: Return final natural language reply
App-->>-User: "It's currently 68°F in Boston."

This sequence diagram clearly shows the complete loop from user question to model decision, tool call, result integration, and final response.

4.2 Core Instructions

Within SGLang functions, you use a series of instructions to build prompts and control the generation flow.

Role Instructions: system(), user(), assistant() These instructions are used to define different parts of a conversation, conforming to the standard multi-turn dialogue format. You can pass strings directly to them.
Generation Instruction: gen() This is the most important instruction in SGLang. It tells the LLM to generate text at the current position.
- s += gen("variable_name", ...): The first parameter of gen() is required and specifies the variable name in which the generation result will be stored in the state s.
- max_tokens: Limits the maximum number of tokens to generate.
- stop: Defines one or more stop strings. When the model generates these strings, the generation process ends early.
- choices: Provides a list of strings, forcing the model to choose one of these options for generation.

Example: A Complete Frontend Function

from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
# Set the backend to the OpenAI-compatible service provided by SGLang
set_default_backend(OpenAI("meta-llama/Meta-Llama-3.1-8B-Instruct"))
@function
def multi_turn_qa(s, question1, question2):
s += system("You are a helpful assistant.")
s += user(question1)
s += assistant(gen("answer1", max_tokens=128))
s += user(question2)
s += assistant(gen("answer2", max_tokens=128))
# Execute the SGLang program
state = multi_turn_qa.run(
question1="What is the capital of the UK?",
question2="What is its population?",
temperature=0.1
)
print("Answer 1:", state["answer1"])
print("Answer 2:", state["answer2"])

4.3 Streaming Output

For applications requiring real-time feedback, SGLang supports streaming output. Simply set stream=True in the .run() method and iterate over the .text_iter() method of the returned state object.

state = multi_turn_qa.run(
question1="Write a short story about a robot.",
question2="Continue the story.",
stream=True
)
for out in state.text_iter("answer2"):
print(out, end="", flush=True)

5. Backend Service (SRT) and API Reference

SGLang's backend, the SGLang Runtime (SRT), is a high-performance inference server implemented in Python. It's responsible for loading models, managing KV caches (through RadixAttention), and handling requests from clients. SRT provides two main API endpoints.

5.1 Native API: `/generate`

This is a lower-level API that provides the finest control over the generation process.

Endpoint: POST /generate
Description: Generate text starting from a given text prompt.
Core Parameters:
- text (string, required): The input text prompt.
- sampling_params (object, optional): A JSON object containing sampling parameters.
  - temperature (float): Sampling temperature.
  - max_new_tokens (int): Maximum number of new tokens to generate.
  - stop (string or list[string]): Stop tokens.
  - json_schema (string): JSON Schema string for constraining output.
  - regex (string): Regular expression for constraining output.
  - ebnf (string): EBNF grammar for constraining output.
- stream (boolean, optional): Whether to use streaming.

Example (using requests):

import requests
import json
url = "http://127.0.0.1:30000/generate"
data = {
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 16,
}
}
response = requests.post(url, json=data)
print(response.json())
# {'text': ' Paris.\n\nThe capital of France is Paris. It is the most populous city in', 'meta': ...}

5.2 OpenAI Compatible API: `/v1/chat/completions`

For easy migration and integration, SGLang provides a chat completion API fully compatible with OpenAI. You can seamlessly use OpenAI's official client library.

Endpoint: POST /v1/chat/completions
Description: Perform chat-style text generation.
Core Parameters:
- model (string, required): The name of the model.
- messages (list[object], required): List of conversation messages.
- temperature, max_tokens, stream, etc.
- response_format (object, optional): For specifying structured output, such as {"type": "json_schema", "json_schema": ...}.
- extra_body (object, optional): SGLang-specific extension parameters, such as {"regex": "..."} or {"ebnf": "..."}.

Example (using the openai library):

import openai
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "List 3 countries and their capitals."}],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)

6. Advanced Usage: Function Calling/Tool Usage

SGLang's powerful programming model makes it very suitable for building AI agents capable of calling external tools. This is typically achieved through structured output, where the model is guided to generate text in a specific format (usually JSON) describing a function call.

Here are the steps to build a simple weather query agent:

1. Define Tool Schema

First, use JSON Schema to define your tool. This tells the model the name of the tool, its purpose, and what parameters it needs.

tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city", "unit"],
},
},
}
]

2. Guide the Model to Make Function Calls

In the messages sent to the model, include a system prompt indicating that the model can use these tools. Then, pass tools and tool_choice="auto" in the API call.

import json
messages = [
{"role": "system", "content": "You are a helpful assistant that can access external tools."},
{"role": "user", "content": "What's the weather like in Boston in fahrenheit?"}
]
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=messages,
tools=tools,
tool_choice="auto",
)
# Check if the model decided to call a tool
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
if tool_calls:
# Model decided to call a tool
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f"Function Call: {function_name}")
print(f"Arguments: {function_args}")
# Here, you could actually execute the function call
# e.g., result = get_current_weather(**function_args)

Output:

Function Call: get_current_weather
Arguments: {'city': 'Boston', 'unit': 'fahrenheit'}

In this way, you can build powerful AI applications capable of interacting with the external world.

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

Thu, 26 Jun 2025 01:06:00 +0000

1. Introduction

Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.

Core Advantages:

High Performance: Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).
Lightweight: Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.
Cross-Platform: Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.
Open Ecosystem: Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.
Continuous Innovation: Quickly follows and implements the latest model architectures and inference optimization techniques.

2. Core Concepts

2.1. GGUF Model Format

GGUF (Georgi Gerganov Universal Format) is the core model file format used by llama.cpp, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.

Key Features:

Unified File: Packages model metadata, vocabulary, and all tensors (weights) in a single file.
Extensibility: Allows adding new metadata without breaking compatibility.
Backward Compatibility: Guarantees compatibility with older versions of GGUF models.
Memory Efficiency: Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.

2.2. Quantization

Quantization is one of the core advantages of llama.cpp. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).

Main Benefits:

Reduced Model Size: Significantly reduces the size of model files, making them easier to distribute and store.
Lower Memory Usage: Reduces the RAM required to load the model into memory.
Faster Inference: Low-precision calculations are typically faster than high-precision ones, especially on CPUs.

llama.cpp supports various quantization methods, particularly k-quants, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.

2.3. Multimodal Support

llama.cpp is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.

Supported Models: Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.
Working Principle: Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.
Tools: llama-mtmd-cli and llama-server provide native support for multimodal models.

3. Usage Methods

3.1. Compilation

Compiling llama.cpp from source is very simple.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make

For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:

# For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1

3.2. Basic Inference

After compilation, you can use the llama-cli tool for inference.

./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400

-m: Specifies the path to the GGUF model file.
-p: Specifies the prompt.
-n: Specifies the maximum number of tokens to generate.

3.3. OpenAI Compatible Server

llama.cpp provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.

Starting the server:

./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096

You can then send requests to http://localhost:8080/v1/chat/completions just like you would with the OpenAI API.

4. Advanced Features

4.1. Speculative Decoding

This is an advanced inference optimization technique that significantly accelerates generation speed by using a small “draft” model to predict the output of the main model.

Working Principle: The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.
Usage: Use the --draft-model parameter in llama-cli or llama-server to specify a small, fast draft model.

4.2. LoRA Support

LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. llama.cpp supports loading one or more LoRA adapters during inference.

./llama-cli -m base-model.gguf --lora lora-adapter.gguf

You can even set different weights for different LoRA adapters:

./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5

4.3. Grammars

Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.

Format: Uses a format called GBNF (GGML BNF) to define grammar rules.
Application: By providing GBNF rules through the grammar parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.

Example: Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.

import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))

5. Ecosystem

The success of llama.cpp has spawned a vibrant ecosystem:

llama-cpp-python: The most popular Python binding, providing interfaces to almost all features of llama.cpp and deeply integrated with frameworks like LangChain and LlamaIndex.
Ollama: A tool for packaging, distributing, and running models, using llama.cpp under the hood, greatly simplifying the process of running LLMs locally.
Numerous UI Tools: The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.

6. Conclusion

llama.cpp is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), llama.cpp provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.

vLLM Technical Guide: High-Performance LLM Inference Engine

Thu, 26 Jun 2025 01:05:00 +0000

1. Introduction to vLLM

vLLM is an open-source inference and serving engine designed for large language models (LLMs), renowned for its high throughput and memory efficiency. In the field of LLM serving, vLLM addresses a core pain point: traditional inference systems are inefficient when handling the key-value cache (KV Cache) in Transformer models’ attention mechanism, resulting in significant memory waste and limited inference speed.

The memory bottleneck in LLM inference primarily stems from the KV Cache. This cache stores attention keys and values for each previous token in a sequence to accelerate the generation of subsequent tokens. However, the size of the KV Cache is dynamic and difficult to predict, creating enormous challenges for memory management. Traditional systems (like HuggingFace Transformers) typically pre-allocate a large continuous memory space to store the KV Cache, leading to severe memory fragmentation and waste.

vLLM fundamentally solves this problem by introducing its core innovation: the PagedAttention mechanism.

2. Core Features and Advantages

vLLM stands out among numerous LLM inference frameworks thanks to several key features:

Extremely High Throughput: Through PagedAttention and Continuous Batching, vLLM significantly improves GPU utilization. Its throughput is several times higher than HuggingFace Transformers and outperforms other mainstream inference libraries.
Efficient Memory Management: The PagedAttention mechanism divides the KV Cache into non-continuous memory blocks, greatly reducing internal and external memory fragmentation. According to official data, it can save up to 55% of memory, meaning you can load larger models or serve more concurrent requests with the same hardware.
Flexible Decoding Strategies: vLLM supports various complex decoding algorithms, including Parallel Sampling, Beam Search, and Top-K/Top-P sampling, meeting the needs of different application scenarios.
OpenAI API Compatibility: vLLM provides a service endpoint that is fully compatible with the OpenAI API. This means you can seamlessly integrate vLLM into existing application ecosystems built on the OpenAI API with just a few configuration changes.
Distributed Inference: For ultra-large models that cannot fit on a single GPU, vLLM supports Tensor Parallelism, distributing model weights and computational load across multiple GPUs for efficient distributed inference.
Streaming and Structured Output: Supports streaming of generated tokens and can produce structured outputs in specific formats (such as JSON Schema or regular expressions) through Guided Generation.

3. Core Architecture: Deep Dive into PagedAttention

PagedAttention is the soul of vLLM, with its design inspiration coming from the paging technique used in modern operating systems to manage virtual memory.

3.1 Working Principle

In traditional methods, the KV Cache for each sequence is stored in continuous memory space. While this approach seems simple, it leads to severe memory fragmentation due to the vast differences in sequence lengths.

PagedAttention divides each sequence's KV Cache into fixed-size blocks. Each block can store keys and values for a fixed number of tokens. During inference, vLLM's core scheduler dynamically allocates these blocks to sequences as needed.

The advantages of this design include:

Eliminating Internal Fragmentation: Since blocks are of fixed size, a sequence's last block may have some unused space, but this waste is far less than that caused by reserving continuous memory for the entire sequence.
Flexible Memory Allocation: Blocks are stored in non-continuous memory space, making memory management more flexible, similar to how operating systems manage physical memory pages.
Efficient Memory Sharing: PagedAttention makes sharing KV Cache between different sequences exceptionally simple and efficient. For example, in parallel sampling or beam search, multiple candidate sequences originate from the same prompt. vLLM allows these sequences to share KV blocks storing the prompt portion, only needing to allocate new, independent blocks for each sequence when generating new tokens. This “Copy-on-Write” mechanism greatly reduces the memory overhead of complex decoding algorithms.

Below is a Mermaid diagram that more intuitively illustrates PagedAttention's memory management approach:

graph TD
subgraph Physical_Memory [KV Cache Physical Memory]
direction LR
B1(Block 1)
B2(Block 2)
B3(Block 3)
B4(Block 4)
B5(Block 5)
B6(Block 6)
B7(Block 7)
B8(Block 8)
end
subgraph Logical_View [Sequence Logical View]
direction TB
subgraph Seq1 [Sequence 1]
P1(Prompt) --> T1(Token 1)
end
subgraph Seq2 [Sequence 2]
P2(Prompt) --> T2(Token 1) --> T3(Token 2)
end
subgraph Seq3 [Parallel Sampling]
P3(Prompt) --> T4(Token 1a)
P3 --> T5(Token 1b)
end
end
subgraph Block_Table [Block Table]
direction TB
Map1["Seq 1: [B1, B5]"]
Map2["Seq 2: [B2, B6, B8]"]
Map3["Seq 3a: [B3, B7]"]
Map4["Seq 3b: [B3, B4]"]
end
Seq1 --> Map1
Seq2 --> Map2
Seq3 --> Map3
Seq3 --> Map4
Map1 --> B1
Map1 --> B5
Map2 --> B2
Map2 --> B6
Map2 --> B8
Map3 --> B3
Map3 --> B7
Map4 --> B3
Map4 --> B4
style B3 fill:#f9f,stroke:#333,stroke-width:2px
linkStyle 8 stroke-width:2px,stroke:green,fill:none;
linkStyle 11 stroke-width:2px,stroke:green,fill:none;
linkStyle 12 stroke-width:2px,stroke:green,fill:none;

Diagram explanation:

KV Cache Physical Memory: Represents non-continuous physical memory blocks on the GPU.
Sequence Logical View: Represents multiple requests (sequences) being processed.
Block Table: vLLM's core component that maps logical token positions to physical memory blocks.
Memory Sharing: Note that the two branches in “Parallel Sampling” (3a and 3b) share the same Prompt block (B3), demonstrating PagedAttention's efficient memory sharing.

3.2 Continuous Batching

Based on PagedAttention, vLLM implements a more advanced batching strategy—continuous batching. Traditional static batching requires waiting for all sequences in a batch to complete generation before processing the next batch. Continuous batching, however, allows new requests to be inserted into the batch immediately after a sequence in the batch completes generation, avoiding GPU idle waiting and further improving throughput.

Below is a comparison of the two batching methods using a Mermaid sequence diagram:

sequenceDiagram
participant C as Client
participant S as Server
participant G as GPU
note over C, G: --- Static Batching ---
C->>S: Request [R1, R2, R3, R4]
S->>G: Process Batch 1 [R1, R2, R3, R4]
note right of G: All requests process in parallel
G-->>S: Batch 1 Finished
note right of S: Wait for the entire batch to complete
S-->>C: Response [O1, O2, O3, O4]
C->>S: Request [R5, R6]
S->>G: Process Batch 2 [R5, R6]
note over C, G: --- Continuous Batching ---
C->>S: Request [R1, R2, R3, R4]
S->>G: Process [R1, R2, R3, R4]
G-->>S: R2 Finished
S-->>C: Response O2
C->>S: New Request R5
S->>G: Add R5 to queue (GPU is not idle)
note right of G: R1, R3, R4, R5 are now processing
G-->>S: R4 Finished
S-->>C: Response O4

4. Quick Start Guide

Below, we'll demonstrate how to install and use vLLM through a few simple steps.

4.1 Installation

You can install vLLM using either pip or uv (a faster package installation tool). Using uv is recommended as it can automatically detect your CUDA version and install the matching PyTorch backend.

Using uv (recommended):

# Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install vllm --torch-backend=auto

Using pip:

pip install vllm

4.2 Offline Inference

The vllm.LLM class makes offline inference very convenient.

from vllm import LLM, SamplingParams
# Define input prompts
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine (model will be automatically downloaded from Hugging Face)
llm = LLM(model="facebook/opt-125m")
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

4.3 Launching an OpenAI-Compatible Server

One of vLLM's most powerful features is its built-in API server. With just one command, you can start a service compatible with the OpenAI API.

vllm serve Qwen/Qwen2.5-1.5B-Instruct

By default, the server will run on http://localhost:8000.

4.4 Interacting with the Server

You can interact with the server using curl or the openai Python client.

Using curl:

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

Using the OpenAI Python client:

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-used" # API key is not required
)
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
)
print(completion.choices[0].message)

5. Model Serving

5.1 Distributed Serving

If a model is too large to fit on a single GPU, you can distribute it across multiple GPUs using tensor parallelism.

# Start a service on 4 GPUs
vllm serve facebook/opt-13b --tensor-parallel-size 4

5.2 Docker Deployment

vLLM provides official Docker images for convenient containerized deployment.

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

6. Advanced Features

6.1 Structured Outputs

vLLM supports various ways to constrain the model's output format, which is crucial for applications requiring reliable, parsable outputs.

Generating JSON using Pydantic models:

from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
model = client.models.list().data[0].id
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": "Generate a JSON with the name and age of one random person."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "people",
"schema": People.model_json_schema()
}
},
)
print(completion.choices[0].message.content)

6.2 LoRA Support

vLLM can efficiently serve multiple LoRA adapters on the same base model. This is particularly useful for scenarios requiring customized models for different customers or tasks.

Starting a server with LoRA support:

from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_lora=True)

Specifying a LoRA adapter in a request:

curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sql-lora", # Specify the LoRA model ID
"prompt": "San Francisco is a",
"max_tokens": 7
}'

6.3 Quantization

Quantization is a technique to reduce model size and memory usage by lowering the precision of model weights. vLLM supports various quantization schemes, such as AWQ and FP8 KV cache.

Enabling FP8 KV cache:

from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8",
calculate_kv_scales=True # Dynamically calculate quantization scales
)

7. Framework Integration

vLLM can be easily integrated with popular LLM application frameworks like Langchain and LlamaIndex for building complex systems such as Retrieval-Augmented Generation (RAG). Typically, vLLM serves as a backend providing fast LLM inference and embedding generation services.

Installing related dependencies:

pip install -U vllm langchain_openai langchain_community

Afterward, in Langchain, you can point the base_url of ChatOpenAI or OpenAIEmbeddings to your vLLM server's address to complete the integration.

8. Conclusion

Through its innovative PagedAttention architecture, vLLM successfully addresses memory management and performance bottlenecks in LLM inference, providing developers with an extremely efficient, flexible, and easy-to-use inference serving engine. Whether conducting quick offline experiments or deploying production-grade, high-concurrency LLM services, vLLM demonstrates excellent performance and powerful functionality. As the community continues to develop, vLLM is becoming one of the standard tools in the field of LLM serving.

Model Inference | Ziyang Lin

TensorRT In-Depth: High-Performance Deep Learning Inference Engine

1. Introduction

2. Core Concepts

3. Key Features and Optimization Techniques

3.1. Precision Calibration & Quantization

3.2. Layer & Tensor Fusion

3.3. Kernel Auto-Tuning

3.4. Dynamic Shapes

3.5. Plugins

3.6. Sparsity

4. Workflow

5. Latest Feature Highlights

6. Appendix: Common Commands

7. TensorRT-LLM: Born for Large Language Model Inference

7.1. What is TensorRT-LLM?

7.2. Core Architecture and Components

7.3. Key Optimization Techniques (LLM-Specific)

7.3.1. In-Flight Batching (also known as Continuous Batching)

7.3.2. Paged KV Cache & Attention

7.3.3. Tensor & Pipeline Parallelism

7.3.4. Advanced Quantization Support (FP8/INT4/INT8)

7.4. TensorRT-LLM Workflow

7.5. Convenient High-Level API (LLM)

7.6. Conclusion

SGLang Technical Guide: High-Performance Structured Generation Framework

1. SGLang Introduction

2. Core Features

2.1 RadixAttention: KV Cache Optimization for Complex Prompts

2.2 Unified Frontend Programming Language (DSL)

SGLang Architecture Overview

2.3 Powerful Structured Output

3. Quick Start

3.1 Installation

3.2 Starting the Backend Service (SRT)

3.3 Sending Your First Request

4. Frontend Language (SGLang DSL)

4.1 @function Decorator

4.2 Core Instructions

4.3 Streaming Output

5. Backend Service (SRT) and API Reference

5.1 Native API: /generate

5.2 OpenAI Compatible API: /v1/chat/completions

6. Advanced Usage: Function Calling/Tool Usage

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

1. Introduction

2. Core Concepts

2.1. GGUF Model Format

2.2. Quantization

2.3. Multimodal Support

3. Usage Methods

3.1. Compilation

3.2. Basic Inference

3.3. OpenAI Compatible Server

4. Advanced Features

4.1. Speculative Decoding

4.2. LoRA Support

4.3. Grammars

5. Ecosystem

6. Conclusion

vLLM Technical Guide: High-Performance LLM Inference Engine

1. Introduction to vLLM

2. Core Features and Advantages

3. Core Architecture: Deep Dive into PagedAttention

3.1 Working Principle

3.2 Continuous Batching

4. Quick Start Guide

4.1 Installation

4.2 Offline Inference

4.3 Launching an OpenAI-Compatible Server

4.4 Interacting with the Server

5. Model Serving

5.1 Distributed Serving

5.2 Docker Deployment

6. Advanced Features

6.1 Structured Outputs

6.2 LoRA Support

6.3 Quantization

7. Framework Integration

8. Conclusion

7.5. Convenient High-Level API (`LLM`)

4.1 `@function` Decorator

5.1 Native API: `/generate`

5.2 OpenAI Compatible API: `/v1/chat/completions`