<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PagedAttention | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/pagedattention/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/pagedattention/index.xml" rel="self" type="application/rss+xml"/><description>PagedAttention</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Thu, 26 Jun 2025 01:05:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>PagedAttention</title><link>https://ziyanglin.netlify.app/en/tags/pagedattention/</link></image><item><title>vLLM Technical Guide: High-Performance LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/vllm-documentation/</link><pubDate>Thu, 26 Jun 2025 01:05:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/vllm-documentation/</guid><description>&lt;h2 id="1-introduction-to-vllm">1. Introduction to vLLM&lt;/h2>
&lt;p>vLLM is an open-source inference and serving engine designed for large language models (LLMs), renowned for its high throughput and memory efficiency. In the field of LLM serving, vLLM addresses a core pain point: traditional inference systems are inefficient when handling the key-value cache (KV Cache) in Transformer models&amp;rsquo; attention mechanism, resulting in significant memory waste and limited inference speed.&lt;/p>
&lt;p>The memory bottleneck in LLM inference primarily stems from the KV Cache. This cache stores attention keys and values for each previous token in a sequence to accelerate the generation of subsequent tokens. However, the size of the KV Cache is dynamic and difficult to predict, creating enormous challenges for memory management. Traditional systems (like HuggingFace Transformers) typically pre-allocate a large continuous memory space to store the KV Cache, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>vLLM fundamentally solves this problem by introducing its core innovation: the &lt;strong>PagedAttention&lt;/strong> mechanism.&lt;/p>
&lt;h2 id="2-core-features-and-advantages">2. Core Features and Advantages&lt;/h2>
&lt;p>vLLM stands out among numerous LLM inference frameworks thanks to several key features:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extremely High Throughput&lt;/strong>: Through PagedAttention and Continuous Batching, vLLM significantly improves GPU utilization. Its throughput is several times higher than HuggingFace Transformers and outperforms other mainstream inference libraries.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Management&lt;/strong>: The PagedAttention mechanism divides the KV Cache into non-continuous memory blocks, greatly reducing internal and external memory fragmentation. According to official data, it can save up to 55% of memory, meaning you can load larger models or serve more concurrent requests with the same hardware.&lt;/li>
&lt;li>&lt;strong>Flexible Decoding Strategies&lt;/strong>: vLLM supports various complex decoding algorithms, including Parallel Sampling, Beam Search, and Top-K/Top-P sampling, meeting the needs of different application scenarios.&lt;/li>
&lt;li>&lt;strong>OpenAI API Compatibility&lt;/strong>: vLLM provides a service endpoint that is fully compatible with the OpenAI API. This means you can seamlessly integrate vLLM into existing application ecosystems built on the OpenAI API with just a few configuration changes.&lt;/li>
&lt;li>&lt;strong>Distributed Inference&lt;/strong>: For ultra-large models that cannot fit on a single GPU, vLLM supports Tensor Parallelism, distributing model weights and computational load across multiple GPUs for efficient distributed inference.&lt;/li>
&lt;li>&lt;strong>Streaming and Structured Output&lt;/strong>: Supports streaming of generated tokens and can produce structured outputs in specific formats (such as JSON Schema or regular expressions) through Guided Generation.&lt;/li>
&lt;/ul>
&lt;h2 id="3-core-architecture-deep-dive-into-pagedattention">3. Core Architecture: Deep Dive into PagedAttention&lt;/h2>
&lt;p>PagedAttention is the soul of vLLM, with its design inspiration coming from the paging technique used in modern operating systems to manage virtual memory.&lt;/p>
&lt;h3 id="31-working-principle">3.1 Working Principle&lt;/h3>
&lt;p>In traditional methods, the KV Cache for each sequence is stored in continuous memory space. While this approach seems simple, it leads to severe memory fragmentation due to the vast differences in sequence lengths.&lt;/p>
&lt;p>PagedAttention divides each sequence's KV Cache into fixed-size &lt;strong>blocks&lt;/strong>. Each block can store keys and values for a fixed number of tokens. During inference, vLLM's core scheduler dynamically allocates these blocks to sequences as needed.&lt;/p>
&lt;p>The advantages of this design include:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Eliminating Internal Fragmentation&lt;/strong>: Since blocks are of fixed size, a sequence's last block may have some unused space, but this waste is far less than that caused by reserving continuous memory for the entire sequence.&lt;/li>
&lt;li>&lt;strong>Flexible Memory Allocation&lt;/strong>: Blocks are stored in non-continuous memory space, making memory management more flexible, similar to how operating systems manage physical memory pages.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Sharing&lt;/strong>: PagedAttention makes sharing KV Cache between different sequences exceptionally simple and efficient. For example, in parallel sampling or beam search, multiple candidate sequences originate from the same prompt. vLLM allows these sequences to share KV blocks storing the prompt portion, only needing to allocate new, independent blocks for each sequence when generating new tokens. This &amp;ldquo;Copy-on-Write&amp;rdquo; mechanism greatly reduces the memory overhead of complex decoding algorithms.&lt;/li>
&lt;/ol>
&lt;p>Below is a Mermaid diagram that more intuitively illustrates PagedAttention's memory management approach:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Physical_Memory [KV Cache Physical Memory]
direction LR
B1(Block 1)
B2(Block 2)
B3(Block 3)
B4(Block 4)
B5(Block 5)
B6(Block 6)
B7(Block 7)
B8(Block 8)
end
subgraph Logical_View [Sequence Logical View]
direction TB
subgraph Seq1 [Sequence 1]
P1(Prompt) --&amp;gt; T1(Token 1)
end
subgraph Seq2 [Sequence 2]
P2(Prompt) --&amp;gt; T2(Token 1) --&amp;gt; T3(Token 2)
end
subgraph Seq3 [Parallel Sampling]
P3(Prompt) --&amp;gt; T4(Token 1a)
P3 --&amp;gt; T5(Token 1b)
end
end
subgraph Block_Table [Block Table]
direction TB
Map1[&amp;quot;Seq 1: [B1, B5]&amp;quot;]
Map2[&amp;quot;Seq 2: [B2, B6, B8]&amp;quot;]
Map3[&amp;quot;Seq 3a: [B3, B7]&amp;quot;]
Map4[&amp;quot;Seq 3b: [B3, B4]&amp;quot;]
end
Seq1 --&amp;gt; Map1
Seq2 --&amp;gt; Map2
Seq3 --&amp;gt; Map3
Seq3 --&amp;gt; Map4
Map1 --&amp;gt; B1
Map1 --&amp;gt; B5
Map2 --&amp;gt; B2
Map2 --&amp;gt; B6
Map2 --&amp;gt; B8
Map3 --&amp;gt; B3
Map3 --&amp;gt; B7
Map4 --&amp;gt; B3
Map4 --&amp;gt; B4
style B3 fill:#f9f,stroke:#333,stroke-width:2px
linkStyle 8 stroke-width:2px,stroke:green,fill:none;
linkStyle 11 stroke-width:2px,stroke:green,fill:none;
linkStyle 12 stroke-width:2px,stroke:green,fill:none;
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Diagram explanation:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>KV Cache Physical Memory&lt;/strong>: Represents non-continuous physical memory blocks on the GPU.&lt;/li>
&lt;li>&lt;strong>Sequence Logical View&lt;/strong>: Represents multiple requests (sequences) being processed.&lt;/li>
&lt;li>&lt;strong>Block Table&lt;/strong>: vLLM's core component that maps logical token positions to physical memory blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: Note that the two branches in &amp;ldquo;Parallel Sampling&amp;rdquo; (3a and 3b) share the same Prompt block (B3), demonstrating PagedAttention's efficient memory sharing.&lt;/li>
&lt;/ul>
&lt;h3 id="32-continuous-batching">3.2 Continuous Batching&lt;/h3>
&lt;p>Based on PagedAttention, vLLM implements a more advanced batching strategy—continuous batching. Traditional static batching requires waiting for all sequences in a batch to complete generation before processing the next batch. Continuous batching, however, allows new requests to be inserted into the batch immediately after a sequence in the batch completes generation, avoiding GPU idle waiting and further improving throughput.&lt;/p>
&lt;p>Below is a comparison of the two batching methods using a Mermaid sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant C as Client
participant S as Server
participant G as GPU
note over C, G: --- Static Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process Batch 1 [R1, R2, R3, R4]
note right of G: All requests process in parallel
G--&amp;gt;&amp;gt;S: Batch 1 Finished
note right of S: Wait for the entire batch to complete
S--&amp;gt;&amp;gt;C: Response [O1, O2, O3, O4]
C-&amp;gt;&amp;gt;S: Request [R5, R6]
S-&amp;gt;&amp;gt;G: Process Batch 2 [R5, R6]
note over C, G: --- Continuous Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process [R1, R2, R3, R4]
G--&amp;gt;&amp;gt;S: R2 Finished
S--&amp;gt;&amp;gt;C: Response O2
C-&amp;gt;&amp;gt;S: New Request R5
S-&amp;gt;&amp;gt;G: Add R5 to queue (GPU is not idle)
note right of G: R1, R3, R4, R5 are now processing
G--&amp;gt;&amp;gt;S: R4 Finished
S--&amp;gt;&amp;gt;C: Response O4
&lt;/code>&lt;/pre>
&lt;h2 id="4-quick-start-guide">4. Quick Start Guide&lt;/h2>
&lt;p>Below, we'll demonstrate how to install and use vLLM through a few simple steps.&lt;/p>
&lt;h3 id="41-installation">4.1 Installation&lt;/h3>
&lt;p>You can install vLLM using either &lt;code>pip&lt;/code> or &lt;code>uv&lt;/code> (a faster package installation tool). Using &lt;code>uv&lt;/code> is recommended as it can automatically detect your CUDA version and install the matching PyTorch backend.&lt;/p>
&lt;p>&lt;strong>Using uv (recommended):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install vllm --torch-backend=auto
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;h3 id="42-offline-inference">4.2 Offline Inference&lt;/h3>
&lt;p>The &lt;code>vllm.LLM&lt;/code> class makes offline inference very convenient.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
# Define input prompts
prompts = [
&amp;quot;Hello, my name is&amp;quot;,
&amp;quot;The capital of France is&amp;quot;,
&amp;quot;The future of AI is&amp;quot;,
]
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine (model will be automatically downloaded from Hugging Face)
llm = LLM(model=&amp;quot;facebook/opt-125m&amp;quot;)
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="43-launching-an-openaicompatible-server">4.3 Launching an OpenAI-Compatible Server&lt;/h3>
&lt;p>One of vLLM's most powerful features is its built-in API server. With just one command, you can start a service compatible with the OpenAI API.&lt;/p>
&lt;pre>&lt;code class="language-bash">vllm serve Qwen/Qwen2.5-1.5B-Instruct
&lt;/code>&lt;/pre>
&lt;p>By default, the server will run on &lt;code>http://localhost:8000&lt;/code>.&lt;/p>
&lt;h3 id="44-interacting-with-the-server">4.4 Interacting with the Server&lt;/h3>
&lt;p>You can interact with the server using &lt;code>curl&lt;/code> or the &lt;code>openai&lt;/code> Python client.&lt;/p>
&lt;p>&lt;strong>Using curl:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7,
&amp;quot;temperature&amp;quot;: 0
}'
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using the OpenAI Python client:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url=&amp;quot;http://localhost:8000/v1&amp;quot;,
api_key=&amp;quot;not-used&amp;quot; # API key is not required
)
completion = client.chat.completions.create(
model=&amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Who won the world series in 2020?&amp;quot;}
]
)
print(completion.choices[0].message)
&lt;/code>&lt;/pre>
&lt;h2 id="5-model-serving">5. Model Serving&lt;/h2>
&lt;h3 id="51-distributed-serving">5.1 Distributed Serving&lt;/h3>
&lt;p>If a model is too large to fit on a single GPU, you can distribute it across multiple GPUs using tensor parallelism.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Start a service on 4 GPUs
vllm serve facebook/opt-13b --tensor-parallel-size 4
&lt;/code>&lt;/pre>
&lt;h3 id="52-docker-deployment">5.2 Docker Deployment&lt;/h3>
&lt;p>vLLM provides official Docker images for convenient containerized deployment.&lt;/p>
&lt;pre>&lt;code class="language-bash">docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env &amp;quot;HUGGING_FACE_HUB_TOKEN=&amp;lt;your-hf-token&amp;gt;&amp;quot; \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-structured-outputs">6.1 Structured Outputs&lt;/h3>
&lt;p>vLLM supports various ways to constrain the model's output format, which is crucial for applications requiring reliable, parsable outputs.&lt;/p>
&lt;p>&lt;strong>Generating JSON using Pydantic models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:8000/v1&amp;quot;, api_key=&amp;quot;dummy&amp;quot;)
model = client.models.list().data[0].id
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Generate a JSON with the name and age of one random person.&amp;quot;}
],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;people&amp;quot;,
&amp;quot;schema&amp;quot;: People.model_json_schema()
}
},
)
print(completion.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h3 id="62-lora-support">6.2 LoRA Support&lt;/h3>
&lt;p>vLLM can efficiently serve multiple LoRA adapters on the same base model. This is particularly useful for scenarios requiring customized models for different customers or tasks.&lt;/p>
&lt;p>&lt;strong>Starting a server with LoRA support:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Specifying a LoRA adapter in a request:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;sql-lora&amp;quot;, # Specify the LoRA model ID
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7
}'
&lt;/code>&lt;/pre>
&lt;h3 id="63-quantization">6.3 Quantization&lt;/h3>
&lt;p>Quantization is a technique to reduce model size and memory usage by lowering the precision of model weights. vLLM supports various quantization schemes, such as AWQ and FP8 KV cache.&lt;/p>
&lt;p>&lt;strong>Enabling FP8 KV cache:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;,
calculate_kv_scales=True # Dynamically calculate quantization scales
)
&lt;/code>&lt;/pre>
&lt;h2 id="7-framework-integration">7. Framework Integration&lt;/h2>
&lt;p>vLLM can be easily integrated with popular LLM application frameworks like Langchain and LlamaIndex for building complex systems such as Retrieval-Augmented Generation (RAG). Typically, vLLM serves as a backend providing fast LLM inference and embedding generation services.&lt;/p>
&lt;p>&lt;strong>Installing related dependencies:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install -U vllm langchain_openai langchain_community
&lt;/code>&lt;/pre>
&lt;p>Afterward, in Langchain, you can point the &lt;code>base_url&lt;/code> of &lt;code>ChatOpenAI&lt;/code> or &lt;code>OpenAIEmbeddings&lt;/code> to your vLLM server's address to complete the integration.&lt;/p>
&lt;h2 id="8-conclusion">8. Conclusion&lt;/h2>
&lt;p>Through its innovative PagedAttention architecture, vLLM successfully addresses memory management and performance bottlenecks in LLM inference, providing developers with an extremely efficient, flexible, and easy-to-use inference serving engine. Whether conducting quick offline experiments or deploying production-grade, high-concurrency LLM services, vLLM demonstrates excellent performance and powerful functionality. As the community continues to develop, vLLM is becoming one of the standard tools in the field of LLM serving.&lt;/p></description></item></channel></rss>