<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/</link><atom:link href="https://ziyanglin.netlify.app/en/index.xml" rel="self" type="application/rss+xml"/><description>Ziyang Lin</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 30 Jun 2025 11:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/</link></image><item><title>LLM Agent Multi-Turn Dialogue: Architecture Design and Implementation Strategies</title><link>https://ziyanglin.netlify.app/en/post/llm-agent-multi-turn-dialogue/</link><pubDate>Mon, 30 Jun 2025 11:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llm-agent-multi-turn-dialogue/</guid><description>&lt;h2 id="1-introduction-why-multiturn-dialogue-is-the-core-lifeline-of-agents">1. Introduction: Why Multi-Turn Dialogue is the Core Lifeline of Agents&lt;/h2>
&lt;p>In the wave of human-machine interaction, Large Language Model (LLM) driven Agents are evolving from simple &amp;ldquo;question-answer&amp;rdquo; tools into &amp;ldquo;intelligent assistants&amp;rdquo; capable of executing complex tasks with reasoning and planning abilities. The core of this evolution lies in &lt;strong>Multi-turn Dialogue&lt;/strong> capabilities.&lt;/p>
&lt;p>Single-turn dialogue resembles a one-time query, while multi-turn dialogue is a continuous, memory-driven, goal-oriented exchange. Users may not provide all information at once, requiring Agents to understand evolving needs, clarify ambiguous instructions, call external tools, and ultimately achieve the user's goals through continuous interaction.&lt;/p>
&lt;p>This document will thoroughly analyze the core challenges faced by LLM Agents in implementing efficient and reliable multi-turn dialogues, and provide a detailed explanation of current mainstream technical architectures and implementation details.&lt;/p>
&lt;h2 id="2-core-challenges-thorny-issues-in-multiturn-dialogues">2. Core Challenges: &amp;ldquo;Thorny Issues&amp;rdquo; in Multi-Turn Dialogues&lt;/h2>
&lt;p>To build a powerful multi-turn dialogue Agent, we must address several fundamental challenges:&lt;/p>
&lt;h3 id="21-context-window-limitation">2.1 Context Window Limitation&lt;/h3>
&lt;p>This is the most fundamental physical constraint. LLMs can only process a limited length of text (tokens). As conversation turns increase, the complete dialogue history quickly exceeds the model's context window.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Macro Issue&lt;/strong>: Leads to &amp;ldquo;memory loss,&amp;rdquo; where the Agent cannot recall early critical information, causing dialogue coherence to break.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: Directly truncating early dialogue history is the simplest but crudest method, potentially losing important premises. For example, preferences set by the user at the beginning of a conversation (&amp;ldquo;I prefer window seats&amp;rdquo;) might be forgotten during subsequent booking steps.&lt;/li>
&lt;/ul>
&lt;h3 id="22-state-maintenance-complexity">2.2 State Maintenance Complexity&lt;/h3>
&lt;p>Agents need to precisely track the dialogue state, such as: What stage is the current task at? What information has the user provided? What information is still needed?&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Macro Issue&lt;/strong>: If the state is confused, the Agent appears &amp;ldquo;muddled,&amp;rdquo; repeatedly asking for known information or getting &amp;ldquo;lost&amp;rdquo; in the task flow.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: State is more than just dialogue history. It's a structured data collection that may include user intent, extracted entities (like dates, locations), API call results, current task nodes, etc. Designing a robust, scalable state management mechanism is a significant engineering challenge.&lt;/li>
&lt;/ul>
&lt;h3 id="23-intent-drifting--goal-forgetting">2.3 Intent Drifting &amp;amp; Goal Forgetting&lt;/h3>
&lt;p>In long conversations, user intent may change, or a large goal may be broken down into multiple subtasks.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Macro Issue&lt;/strong>: Agents need to understand and adapt to these dynamic changes rather than rigidly adhering to the initial goal. If a user checks the weather and then says, &amp;ldquo;Book me a flight there,&amp;rdquo; the Agent must recognize this as a new, related intent.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: This requires the Agent to have strong intent recognition and reasoning capabilities to determine whether the current user input is continuing, modifying, or starting a completely new task.&lt;/li>
&lt;/ul>
&lt;h3 id="24-error-handling--selfcorrection">2.4 Error Handling &amp;amp; Self-Correction&lt;/h3>
&lt;p>When tool calls fail (e.g., API timeout), information extraction errors occur, or understanding deviates, the Agent cannot simply crash or give up.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Macro Issue&lt;/strong>: A reliable Agent should be able to identify failures and proactively initiate correction processes, such as retrying, clarifying with the user, or finding alternatives.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: This requires designing fault tolerance and retry mechanisms at the architectural level. The Agent needs to &amp;ldquo;understand&amp;rdquo; error messages returned by tools and generate new &amp;ldquo;thoughts&amp;rdquo; based on these to plan the next corrective action.&lt;/li>
&lt;/ul>
&lt;h2 id="3-technical-architecture-evolution-and-analysis">3. Technical Architecture Evolution and Analysis&lt;/h2>
&lt;p>To address the above challenges, the industry has explored various solutions, from simple history compression to complex Agentic architectures.&lt;/p>
&lt;h3 id="31-early-attempts-dialogue-history-compression">3.1 Early Attempts: Dialogue History Compression&lt;/h3>
&lt;p>This is the most direct approach to solving context window limitations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Summary Memory&lt;/strong>: After each round of dialogue, or when the history length approaches a threshold, another LLM call summarizes the existing conversation.
&lt;ul>
&lt;li>&lt;strong>Advantage&lt;/strong>: Effectively reduces length.&lt;/li>
&lt;li>&lt;strong>Disadvantage&lt;/strong>: The summarization process may lose details and adds additional LLM call costs and latency.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="32-react-architecture-giving-agents-the-ability-to-think">3.2 ReAct Architecture: Giving Agents the Ability to &amp;ldquo;Think&amp;rdquo;&lt;/h3>
&lt;p>ReAct (Reason + Act) is the cornerstone of today's mainstream Agent architectures. Through an elegant &amp;ldquo;think-act-observe&amp;rdquo; cycle, it transforms an LLM from a mere text generator into an entity with reasoning and execution capabilities.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Macro Concept&lt;/strong>: Mimics the human problem-solving pattern—first analyze (Reason), then take action (Act), and finally observe results (Observation) and adjust approach.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Underlying Implementation&lt;/strong>: Through carefully designed prompts, guides the LLM to generate text with specific markers.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Thought&lt;/strong>: The LLM performs an &amp;ldquo;inner monologue&amp;rdquo; at this step, analyzing the current situation and planning the next action. This content is invisible to users.&lt;/li>
&lt;li>&lt;strong>Action&lt;/strong>: The LLM decides which tool to call and what parameters to pass. For example, &lt;code>search(&amp;quot;Beijing weather today&amp;quot;)&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Observation&lt;/strong>: Feeds back the results of tool execution (such as API returned data, database query results) to the LLM.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>This cycle repeats until the Agent considers the task complete.&lt;/p>
&lt;h4 id="react-work-cycle">ReAct Work Cycle&lt;/h4>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;User Input&amp;quot;] --&amp;gt; B{&amp;quot;LLM Generates Thought and Action&amp;quot;};
B -- Thought --&amp;gt; C[&amp;quot;Inner Monologue: What should I do?&amp;quot;];
C --&amp;gt; D{&amp;quot;Action: Call Tool&amp;quot;};
D -- &amp;quot;Tool Input&amp;quot; --&amp;gt; E[&amp;quot;External Tool (API, DB)&amp;quot;];
E -- &amp;quot;Tool Output&amp;quot; --&amp;gt; F[&amp;quot;Observation: Get Result&amp;quot;];
F --&amp;gt; G{&amp;quot;LLM Generates New Thought Based on Observation&amp;quot;};
G -- &amp;quot;Thought&amp;quot; --&amp;gt; H[&amp;quot;Inner Monologue: ...&amp;quot;];
H --&amp;gt; I{&amp;quot;Is Task Complete?&amp;quot;};
I -- &amp;quot;No&amp;quot; --&amp;gt; D;
I -- &amp;quot;Yes&amp;quot; --&amp;gt; J[&amp;quot;Final Answer&amp;quot;];
J --&amp;gt; K[&amp;quot;Respond to User&amp;quot;];
&lt;/code>&lt;/pre>
&lt;h3 id="33-finite-state-machine-fsm-building-tracks-for-dialogue-flow">3.3 Finite State Machine (FSM): Building &amp;ldquo;Tracks&amp;rdquo; for Dialogue Flow&lt;/h3>
&lt;p>For tasks with clear goals and relatively fixed processes (such as food ordering, customer service), Finite State Machines (FSM) are an extremely powerful and reliable architecture.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Macro Concept&lt;/strong>: Abstract complex dialogue processes into a series of discrete &amp;ldquo;states&amp;rdquo; and &amp;ldquo;transition conditions&amp;rdquo; between these states. The Agent is in a clear state at any moment and can only transition to the next state through predefined paths.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Underlying Implementation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>States&lt;/strong>: Define possible nodes in the dialogue, such as &lt;code>AskLocation&lt;/code>, &lt;code>AskCuisine&lt;/code>, &lt;code>ConfirmOrder&lt;/code>, &lt;code>OrderPlaced&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Transitions&lt;/strong>: Define rules for state switching, typically triggered by user input or tool output. For example, in the &lt;code>AskLocation&lt;/code> state, if location information is successfully extracted from user input, transition to the &lt;code>AskCuisine&lt;/code> state.&lt;/li>
&lt;li>&lt;strong>State Handler&lt;/strong>: Each state is associated with a handler function responsible for executing specific logic in that state (such as asking the user questions, calling APIs).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="a-simple-food-ordering-agent">A Simple Food Ordering Agent&lt;/h4>
&lt;pre>&lt;code class="language-mermaid">stateDiagram-v2
[*] --&amp;gt; Awaiting_Order
Awaiting_Order: User initiates food order
Awaiting_Order --&amp;gt; Collect_Cuisine: Identify ordering intent
Collect_Cuisine: &amp;quot;What cuisine would you like?&amp;quot;
Collect_Cuisine --&amp;gt; Collect_Headcount: User provides cuisine
Collect_Headcount: &amp;quot;How many people dining?&amp;quot;
Collect_Headcount --&amp;gt; Confirmation: User provides headcount
state Confirmation {
direction LR
[*] --&amp;gt; Show_Summary
Show_Summary: &amp;quot;Booking [headcount] for [cuisine], confirm?&amp;quot;
Show_Summary --&amp;gt; Finalize: User confirms
Finalize --&amp;gt; [*]
}
Confirmation --&amp;gt; Collect_Cuisine: User modifies
&lt;/code>&lt;/pre>
&lt;h4 id="modern-evolution-of-fsm-dynamic-and-hierarchical">Modern Evolution of FSM: Dynamic and Hierarchical&lt;/h4>
&lt;p>Traditional FSMs rely on hardcoded rules for state transitions, which can be rigid when facing complex, changing real-world scenarios. Modern Agent design deeply integrates FSM with LLM capabilities, giving rise to more intelligent and flexible architectures.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>LLM-Driven State Transitions&lt;/strong>: Rather than using fixed &lt;code>if-else&lt;/code> rules to determine state changes, let the LLM make decisions. In each cycle, pass the dialogue history, current user input, and a list of all possible target states to the LLM, allowing it to determine the most appropriate next state based on its powerful context understanding. This upgrades state transitions from &amp;ldquo;rule-driven&amp;rdquo; to &amp;ldquo;intelligence-driven.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>State-Specific Prompts&lt;/strong>: This is a powerful application of dynamic prompting. For each core state node in the FSM, a highly optimized set of dedicated prompts can be pre-designed. When the Agent enters a certain state (such as &lt;code>Collect_Cuisine&lt;/code>), the system immediately activates the prompt corresponding to that state. This prompt not only guides the LLM on how to interact with users at that node but can also define tools that can be called in that state, rules to follow, etc. This allows the Agent to &amp;ldquo;wear different hats&amp;rdquo; at different task stages, exhibiting high professionalism and task relevance.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h5 id="example-statespecific-prompt-for-queryflights-state-in-flight-booking-subprocess">Example: State-Specific Prompt for &lt;code>Query_Flights&lt;/code> State in Flight Booking Sub-Process&lt;/h5>
&lt;pre>&lt;code># IDENTITY
You are a world-class flight booking assistant AI.
# STATE &amp;amp; GOAL
You are currently in the &amp;quot;Query_Flights&amp;quot; state.
Your SOLE GOAL is to collect the necessary information to search for flights.
The necessary information is: origin city, destination city, and departure date.
# AVAILABLE TOOLS
- `flight_search_api(origin: str, destination: str, date: str)`: Use this tool to search for flights.
# CONTEXT
- Conversation History:
{conversation_history}
- User Profile:
{user_profile}
- Current State Data:
{state_data} # e.g., {&amp;quot;origin&amp;quot;: &amp;quot;Shanghai&amp;quot;, &amp;quot;destination&amp;quot;: &amp;quot;Beijing&amp;quot;, &amp;quot;date&amp;quot;: null}
# RULES
1. Analyze the Current State Data first.
2. If any necessary information (origin, destination, date) is missing, you MUST ask the user for it clearly.
3. Phrase your questions to sound helpful and natural.
4. Once all information is collected, your FINAL ACTION MUST be to call the `flight_search_api` tool with the correct parameters.
5. Do not make up information. Do not ask for information that is not required (e.g., return date, unless specified by the user).
# OUTPUT FORMAT
Your output must be a single JSON object.
- To ask a question: {&amp;quot;action&amp;quot;: &amp;quot;ask_user&amp;quot;, &amp;quot;question&amp;quot;: &amp;quot;Your question here.&amp;quot;}
- To call a tool: {&amp;quot;action&amp;quot;: &amp;quot;call_tool&amp;quot;, &amp;quot;tool_name&amp;quot;: &amp;quot;flight_search_api&amp;quot;, &amp;quot;tool_params&amp;quot;: {&amp;quot;origin&amp;quot;: &amp;quot;...&amp;quot;, &amp;quot;destination&amp;quot;: &amp;quot;...&amp;quot;, &amp;quot;date&amp;quot;: &amp;quot;...&amp;quot;}}
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Hierarchical FSM&lt;/strong>: For large complex tasks, a single flat state diagram is difficult to manage. Hierarchical FSMs introduce the concept of &amp;ldquo;SOP nesting&amp;rdquo; or &amp;ldquo;sub-state diagrams.&amp;rdquo; A high-level FSM (main SOP) is responsible for planning the macro business process (such as &amp;ldquo;complete a travel booking&amp;rdquo;), and when the process reaches a certain macro state (such as &amp;ldquo;book flight&amp;rdquo;), it can activate an embedded, more detailed sub-FSM (sub-SOP) that specifically handles a series of refined operations like &amp;ldquo;query flights -&amp;gt; select seats -&amp;gt; confirm payment.&amp;rdquo; This pattern greatly enhances the modularity and manageability of task decomposition.&lt;/li>
&lt;/ul>
&lt;h5 id="hierarchical-state-machine-sop-nesting-example">Hierarchical State Machine (SOP Nesting) Example&lt;/h5>
&lt;pre>&lt;code class="language-mermaid">stateDiagram-v2
direction LR
[*] --&amp;gt; MainSOP
state &amp;quot;Main Process: Travel Planning (Main SOP)&amp;quot; as MainSOP {
[*] --&amp;gt; Collect_Trip_Info
note right of Collect_Trip_Info
User: &amp;quot;Help me plan a trip to Beijing&amp;quot;
end note
Collect_Trip_Info --&amp;gt; Book_Flight_Sub_SOP : &amp;quot;OK, let's book flights first&amp;quot;
state &amp;quot;Sub-Process: Flight Booking&amp;quot; as Book_Flight_Sub_SOP {
direction LR
[*] --&amp;gt; Query_Flights: &amp;quot;When do you want to depart?&amp;quot;
Query_Flights --&amp;gt; Select_Seat: &amp;quot;Found flights, please select seat&amp;quot;
Select_Seat --&amp;gt; Confirm_Payment: &amp;quot;Seat selected, please pay&amp;quot;
Confirm_Payment --&amp;gt; [*]: Payment successful
}
Book_Flight_Sub_SOP --&amp;gt; Book_Hotel: &amp;quot;Flight booked, now for hotel&amp;quot;
Book_Hotel --&amp;gt; Finalize_Trip: &amp;quot;Hotel booked, final confirmation&amp;quot;
Finalize_Trip --&amp;gt; [*]
}
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>FSM vs. ReAct&lt;/strong>: FSM is structured, predictable, and easy to debug, making it very suitable for task-oriented dialogues. ReAct is more flexible and versatile, suitable for handling open-ended tasks requiring complex reasoning and dynamic planning. In practice, the two are often combined (for example, using ReAct to handle an open-ended subtask within an FSM state, or as mentioned above, using an LLM to drive FSM state transitions).&lt;/p>
&lt;h2 id="4-core-components-agents-memory-system">4. Core Components: Agent's &amp;ldquo;Memory&amp;rdquo; System&lt;/h2>
&lt;p>Regardless of the architecture used, a powerful memory system is the cornerstone of effective multi-turn dialogue.&lt;/p>
&lt;h3 id="41-shortterm-memory">4.1 Short-term Memory&lt;/h3>
&lt;p>Also known as working memory, primarily responsible for storing recent dialogue history.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Typical Implementation&lt;/strong>: &lt;code>ConversationBufferMemory&lt;/code> or &lt;code>ConversationBufferWindowMemory&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>ConversationBufferMemory&lt;/code>: Stores complete dialogue history. Simple and direct, but quickly exhausts the context window in long conversations.&lt;/li>
&lt;li>&lt;code>ConversationBufferWindowMemory&lt;/code>: Only keeps the most recent &lt;code>k&lt;/code> turns of dialogue. This sliding window mechanism effectively controls length but risks losing important early information.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="42-longterm-memory">4.2 Long-term Memory&lt;/h3>
&lt;p>Responsible for storing cross-dialogue, persistent knowledge and information.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Typical Implementation&lt;/strong>: Retrieval-Augmented Generation (RAG) based on &lt;strong>vector databases&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>:
&lt;ol>
&lt;li>Chunk external documents (such as product manuals, knowledge base articles) or key information from past conversations.&lt;/li>
&lt;li>Use an Embedding model to convert text blocks into vectors.&lt;/li>
&lt;li>Store vectors in a vector database (such as Chroma, Pinecone, FAISS).&lt;/li>
&lt;li>When a user asks a question, convert their question into a vector as well.&lt;/li>
&lt;li>Perform similarity search in the vector database to find the most relevant text blocks.&lt;/li>
&lt;li>Inject these text blocks as context along with the user's question into the LLM's prompt, guiding it to generate more precise answers.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;h3 id="43-structured-memory">4.3 Structured Memory&lt;/h3>
&lt;p>Stores and retrieves information in a structured way, especially key entities and their relationships from conversations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Typical Implementation&lt;/strong>: Entity-relationship storage based on knowledge graphs, such as the &lt;code>Graphiti&lt;/code> project using Neo4j.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>:
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Knowledge Graph Advantages&lt;/strong>: Unlike simple key-value storage, knowledge graphs can capture complex relationship networks between entities. For example, not just recording a person named &amp;ldquo;John,&amp;rdquo; but also recording &amp;ldquo;John is Mary's manager,&amp;rdquo; &amp;ldquo;John is responsible for Project A,&amp;rdquo; and other relationship information.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Graphiti Project Analysis&lt;/strong>: &lt;a href="https://github.com/getzep/graphiti">Graphiti&lt;/a> is a knowledge graph memory system designed specifically for LLM Agents, seamlessly integrating Neo4j's graph database capabilities with LLM's natural language processing abilities.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Workflow&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Entity and Relationship Extraction&lt;/strong>: LLM analyzes conversation content, identifying key entities and their relationships&lt;/li>
&lt;li>&lt;strong>Graph Construction&lt;/strong>: Transforms identified entities and relationships into Cypher query statements, dynamically updating the Neo4j graph database&lt;/li>
&lt;li>&lt;strong>Context Enhancement&lt;/strong>: In subsequent conversations, retrieves relevant entity networks through graph queries, injecting them as context into the LLM's prompt&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Technical Highlights&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Automatic Schema Inference&lt;/strong>: No need to predefine entity types and relationships; the system can automatically infer appropriate graph structures from conversations&lt;/li>
&lt;li>&lt;strong>Incremental Updates&lt;/strong>: As conversations progress, the graph is continuously enriched and corrected, forming an increasingly complete knowledge network&lt;/li>
&lt;li>&lt;strong>Relationship Reasoning&lt;/strong>: Supports multi-hop queries, able to discover indirectly associated information (e.g., &amp;ldquo;Who are the colleagues of John's manager?&amp;quot;)&lt;/li>
&lt;li>&lt;strong>Temporal Awareness&lt;/strong>: Graphiti/Zep's core feature is its Temporal Knowledge Graph architecture, where each node and relationship carries timestamp attributes, enabling the system to:
&lt;ul>
&lt;li>Track how entity states change over time (e.g., &amp;ldquo;John was a developer last year, promoted to project manager this year&amp;rdquo;)&lt;/li>
&lt;li>Perform temporal reasoning (e.g., &amp;ldquo;What was B's status before event A occurred?&amp;quot;)&lt;/li>
&lt;li>Resolve time-related queries (e.g., &amp;ldquo;How is the project mentioned last month progressing now?&amp;quot;)&lt;/li>
&lt;li>Automatically identify and handle outdated information, ensuring answers are based on the latest factual state&lt;/li>
&lt;li>Build event timelines, helping the Agent understand causal relationships and event sequences&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Practical Application Example&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">from graphiti import GraphMemory
# Initialize graph memory
graph_memory = GraphMemory(
neo4j_uri=&amp;quot;neo4j://localhost:7687&amp;quot;,
neo4j_user=&amp;quot;neo4j&amp;quot;,
neo4j_password=&amp;quot;password&amp;quot;
)
# Update graph in conversation
user_message = &amp;quot;My project manager John said we're starting a new project next week&amp;quot;
graph_memory.update_from_text(user_message)
# Retrieve relevant information in subsequent conversations
query = &amp;quot;Who is the project manager?&amp;quot;
context = graph_memory.retrieve_relevant_context(query)
# Returns: &amp;quot;John is the project manager, responsible for a new project starting next week.&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Comparison with Traditional Entity Memory&lt;/strong>: Traditional methods can only store flat entity-attribute pairs, while knowledge graph methods can express and query complex multi-level relationship networks, providing Agents with richer, more insightful contextual information.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Essentially a Form of Long-term Memory&lt;/strong>: Although we discuss structured memory as a separate category, knowledge graph systems like Graphiti/Zep are essentially an advanced form of long-term memory. They not only persistently store information across conversations but also organize this information in a more structured, queryable, and reasoning-friendly way. Compared to semantic similarity retrieval in vector databases, knowledge graphs provide more precise relationship navigation and reasoning capabilities.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="graphitizep-temporal-knowledge-graph-architecture-and-workflow">Graphiti/Zep Temporal Knowledge Graph Architecture and Workflow&lt;/h4>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;User Conversation History&amp;quot;
A1[&amp;quot;Conversation 1: 'I'm John, a software engineer'&amp;quot;] --&amp;gt; A2[&amp;quot;Conversation 2: 'I'm responsible for Project A'&amp;quot;]
A2 --&amp;gt; A3[&amp;quot;Conversation 3: 'I was a developer last year, promoted to project manager this year'&amp;quot;]
A3 --&amp;gt; A4[&amp;quot;Conversation 4: 'Mary is a member of my team'&amp;quot;]
end
subgraph &amp;quot;Entity and Relationship Extraction&amp;quot;
B[&amp;quot;LLM Analyzer&amp;quot;] --&amp;gt; C[&amp;quot;Entity Recognition: John, Project A, Mary&amp;quot;]
B --&amp;gt; D[&amp;quot;Relationship Extraction: John-responsible for-Project A, John-manages-Mary&amp;quot;]
B --&amp;gt; E[&amp;quot;Temporal Attributes: John.role(2024)=project manager, John.role(2023)=developer&amp;quot;]
end
subgraph &amp;quot;Temporal Knowledge Graph&amp;quot;
F[&amp;quot;John (Person)&amp;quot;] -- &amp;quot;role(2023)&amp;quot; --&amp;gt; G[&amp;quot;Developer&amp;quot;]
F -- &amp;quot;role(2024)&amp;quot; --&amp;gt; H[&amp;quot;Project Manager&amp;quot;]
F -- &amp;quot;responsible for(2024)&amp;quot; --&amp;gt; I[&amp;quot;Project A&amp;quot;]
F -- &amp;quot;manages(2024)&amp;quot; --&amp;gt; J[&amp;quot;Mary (Person)&amp;quot;]
end
subgraph &amp;quot;Query and Reasoning&amp;quot;
K[&amp;quot;User Question: 'What was John's position last year?'&amp;quot;]
L[&amp;quot;Graph Query: MATCH (p:Person {name:'John'})-[r:role {year:2023}]-&amp;gt;(role) RETURN role&amp;quot;]
M[&amp;quot;Result: 'Developer'&amp;quot;]
N[&amp;quot;Temporal Reasoning: 'John's career progression is from developer to project manager'&amp;quot;]
end
A4 --&amp;gt; B
E --&amp;gt; F
K --&amp;gt; L
L --&amp;gt; M
M --&amp;gt; N
style F fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#bbf,stroke:#333,stroke-width:2px
style J fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#bfb,stroke:#333,stroke-width:2px
style H fill:#bfb,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>This diagram shows how Graphiti/Zep transforms conversation history into a knowledge graph with a temporal dimension, supporting time-based queries and reasoning. Timestamps enable the system to track the evolution of entity attributes and relationships, answering &amp;ldquo;when&amp;rdquo; and &amp;ldquo;how changed&amp;rdquo; types of questions, capabilities that traditional knowledge graphs and vector stores struggle to achieve.&lt;/p>
&lt;h3 id="44-summary-memory">4.4 Summary Memory&lt;/h3>
&lt;p>As mentioned earlier, saves space by creating rolling summaries of dialogue history.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Typical Implementation&lt;/strong>: &lt;code>ConversationSummaryMemory&lt;/code> or &lt;code>ConversationSummaryBufferMemory&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>ConversationSummaryMemory&lt;/code>: Summarizes the entire dialogue history each time, which is costly.&lt;/li>
&lt;li>&lt;code>ConversationSummaryBufferMemory&lt;/code>: A hybrid strategy. It keeps the most recent &lt;code>k&lt;/code> turns of complete dialogue while maintaining a rolling summary of earlier conversations. This achieves a good balance between cost and information fidelity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="45-user-profile-memory">4.5 User Profile Memory&lt;/h3>
&lt;p>This is a more proactive, advanced form of structured memory, aimed at going beyond single conversations to establish a persistent, dynamically updated &amp;ldquo;profile&amp;rdquo; for users. The Agent not only remembers conversation content but also &amp;ldquo;who you are.&amp;rdquo;&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Macro Concept&lt;/strong>: Structurally store user preferences, habits, historical choices, and even demographic information (with user authorization). In each interaction, inject this &amp;ldquo;user profile&amp;rdquo; as key context directly into the prompt, allowing the LLM to &amp;ldquo;understand&amp;rdquo; its conversation partner from the start.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Underlying Implementation&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Structure&lt;/strong>: Typically maintains user metadata in the form of key-value pairs (such as JSON objects). For example: &lt;code>{&amp;quot;user_id&amp;quot;: &amp;quot;123&amp;quot;, &amp;quot;preferred_language&amp;quot;: &amp;quot;English&amp;quot;, &amp;quot;dietary_restrictions&amp;quot;: [&amp;quot;vegetarian&amp;quot;], &amp;quot;home_city&amp;quot;: &amp;quot;Shanghai&amp;quot;}&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Prompt Injection&lt;/strong>: When building the final prompt, include the serialized user profile string (such as &lt;code>[UserProfile]...[/UserProfile]&lt;/code>) as a fixed part of the context.&lt;/li>
&lt;li>&lt;strong>Dynamic Maintenance&lt;/strong>: This is the core of the mechanism. After a conversation ends, the Agent or a background process analyzes the interaction to determine if the user profile needs updating. For example, when a user says &amp;ldquo;I recently moved to Beijing,&amp;rdquo; the system needs a mechanism to update the &lt;code>home_city&lt;/code> field. This update process itself may require a separate LLM call for information extraction and decision-making.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Personalization&lt;/strong>: The Agent can provide forward-looking, highly customized services.&lt;/li>
&lt;li>&lt;strong>Conversation Efficiency&lt;/strong>: Avoids repeatedly asking users for basic preferences, making interactions smoother.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Challenges&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Update Mechanism Complexity&lt;/strong>: How to accurately and safely update user profiles is a technical challenge.&lt;/li>
&lt;li>&lt;strong>Token Consumption&lt;/strong>: User profiles occupy valuable context window space.&lt;/li>
&lt;li>&lt;strong>Data Privacy&lt;/strong>: Must strictly adhere to user privacy policies.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="5-summary-and-outlook">5. Summary and Outlook&lt;/h2>
&lt;p>Building an LLM Agent capable of smooth, intelligent multi-turn dialogue is a complex system engineering task. It requires us to:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Face Physical Limitations&lt;/strong>: Overcome context window bottlenecks through clever &lt;strong>memory management mechanisms&lt;/strong> (such as summaries, RAG).&lt;/li>
&lt;li>&lt;strong>Choose Appropriate Architecture&lt;/strong>: Balance &lt;strong>flexibility (ReAct)&lt;/strong> and &lt;strong>structure (FSM)&lt;/strong> based on task complexity, or even combine both.&lt;/li>
&lt;li>&lt;strong>Design Robust Processes&lt;/strong>: Build in &lt;strong>state tracking&lt;/strong>, &lt;strong>intent recognition&lt;/strong>, and &lt;strong>error correction&lt;/strong> capabilities to keep the Agent stable and reliable in complex interactions.&lt;/li>
&lt;/ol>
&lt;p>Future development will focus more on the Agent's autonomous learning and evolution capabilities. Agents will not only execute tasks but also learn new skills from interactions with users, optimize their tool calling strategies, and dynamically adjust their conversation style, ultimately becoming truly personalized intelligent partners.&lt;/p></description></item><item><title>Retrieval-Augmented Generation (RAG): A Comprehensive Technical Analysis</title><link>https://ziyanglin.netlify.app/en/post/rag-technical-documentation/</link><pubDate>Mon, 30 Jun 2025 10:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/rag-technical-documentation/</guid><description>&lt;h2 id="1-macro-overview-why-rag">1. Macro Overview: Why RAG?&lt;/h2>
&lt;h3 id="11-what-is-rag">1.1 What is RAG?&lt;/h3>
&lt;p>RAG, or Retrieval-Augmented Generation, is a technical framework that combines information retrieval from external knowledge bases with the powerful generative capabilities of large language models (LLMs). In simple terms, when a user asks a question, a RAG system first retrieves the most relevant information snippets from a vast, updatable knowledge base (such as company internal documents, product manuals, or the latest web information), and then &amp;ldquo;feeds&amp;rdquo; this information along with the original question to the language model, enabling it to generate answers based on precise, up-to-date context.&lt;/p>
&lt;p>To use an analogy: Imagine a student taking an open-book exam. This student (the LLM) has already learned a lot of knowledge (pre-training data), but when answering very specific questions or those involving the latest information, they can refer to reference books (external knowledge base). RAG is this &amp;ldquo;open-book&amp;rdquo; process, allowing the LLM to consult the most recent and authoritative materials when answering questions, thus providing more accurate and comprehensive answers.&lt;/p>
&lt;h3 id="12-rags-core-value-solving-llms-inherent-limitations">1.2 RAG's Core Value: Solving LLM's Inherent Limitations&lt;/h3>
&lt;p>Despite their power, large language models have several inherent limitations that RAG technology specifically addresses.&lt;/p>
&lt;p>&lt;strong>Limitation 1: Knowledge Cut-off&lt;/strong>&lt;/p>
&lt;p>An LLM's knowledge is frozen at the time of its last training. For example, a model completed in early 2023 cannot answer questions about events that occurred after that point. RAG completely solves this problem by introducing an external knowledge base that can be updated at any time. Companies can update their knowledge bases with the latest product information, financial reports, market dynamics, etc., and the RAG system can immediately leverage this new knowledge to answer questions.&lt;/p>
&lt;p>&lt;strong>Limitation 2: Hallucination&lt;/strong>&lt;/p>
&lt;p>When LLMs encounter questions outside their knowledge domain or with uncertain answers, they sometimes &amp;ldquo;confidently make things up,&amp;rdquo; fabricating facts and producing what are known as &amp;ldquo;hallucinations.&amp;rdquo; RAG greatly constrains model output by providing clear, fact-based reference materials. The model is required to answer based on the retrieved context, which effectively defines the scope of its response, significantly reducing the probability of hallucinations.&lt;/p>
&lt;p>&lt;strong>Limitation 3: Lack of Domain-Specific Knowledge&lt;/strong>&lt;/p>
&lt;p>General-purpose LLMs often perform poorly when handling specialized questions in specific industries or enterprises. For example, they don't understand a company's internal processes or the technical specifications of particular products. Through RAG, enterprises can build a specialized knowledge base containing internal regulations, technical documentation, customer support records, and more. This equips the LLM with domain expert knowledge, enabling it to handle highly specialized Q&amp;amp;A tasks.&lt;/p>
&lt;p>&lt;strong>Limitation 4: Lack of Transparency &amp;amp; Interpretability&lt;/strong>&lt;/p>
&lt;p>The answer generation process of traditional LLMs is a &amp;ldquo;black box&amp;rdquo; - we cannot know what information they based their conclusions on. This is fatal in fields requiring high credibility, such as finance, healthcare, and law. The RAG architecture naturally enhances transparency because the system can clearly show &amp;ldquo;I derived this answer based on these documents (Source 1, Source 2&amp;hellip;)&amp;quot;. Users can trace and verify the sources of information, greatly enhancing trust in the answers.&lt;/p>
&lt;h3 id="13-rags-macro-workflow">1.3 RAG's Macro Workflow&lt;/h3>
&lt;p>At the highest level, RAG's workflow can be depicted as a simple yet elegant architecture.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;User Query&amp;quot;] --&amp;gt; B{RAG System};
B --&amp;gt; C[&amp;quot;Retrieve&amp;quot;];
C --&amp;gt; D[&amp;quot;External Knowledge Base&amp;quot;];
D --&amp;gt; C;
C --&amp;gt; E[&amp;quot;Augment&amp;quot;];
A --&amp;gt; E;
E --&amp;gt; F[&amp;quot;Generate&amp;quot;];
F --&amp;gt; G[LLM];
G --&amp;gt; F;
F --&amp;gt; H[&amp;quot;Final Answer with Sources&amp;quot;];
&lt;/code>&lt;/pre>
&lt;p>This workflow can be interpreted as:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Retrieve&lt;/strong>: After receiving a user's question, the system first converts it into a format suitable for searching (such as a vector), then quickly matches and retrieves the most relevant information snippets from the knowledge base.&lt;/li>
&lt;li>&lt;strong>Augment&lt;/strong>: The system integrates the retrieved information snippets with the user's original question into a richer &amp;ldquo;prompt.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Generate&lt;/strong>: This enhanced prompt is sent to the LLM, guiding it to generate a content-rich and accurate answer based on the provided context, along with sources of information.&lt;/li>
&lt;/ol>
&lt;p>Through this process, RAG successfully transforms the LLM from a &amp;ldquo;closed-world scholar&amp;rdquo; into an &amp;ldquo;open-world, verifiable expert.&amp;rdquo;&lt;/p>
&lt;h2 id="2-rag-core-architecture-dual-process-analysis">2. RAG Core Architecture: Dual Process Analysis&lt;/h2>
&lt;p>The lifecycle of a RAG system can be clearly divided into two core processes:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Offline Process: Indexing&lt;/strong>: This is a preprocessing stage responsible for transforming raw data sources into a knowledge base ready for quick retrieval. This process typically runs in the background and is triggered whenever the knowledge base content needs updating.&lt;/li>
&lt;li>&lt;strong>Online Process: Retrieval &amp;amp; Generation&lt;/strong>: This is the real-time process of user interaction with the system, responsible for retrieving information from the index based on user input and generating answers.&lt;/li>
&lt;/ol>
&lt;p>Below, we'll analyze these two processes through detailed diagrams and explanations.&lt;/p>
&lt;h3 id="21-offline-process-indexing">2.1 Offline Process: Indexing&lt;/h3>
&lt;p>The goal of this process is to transform unstructured or semi-structured raw data into structured, easily queryable indices.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Offline Indexing Pipeline&amp;quot;
A[&amp;quot;Data Sources&amp;quot;] --&amp;gt; B[&amp;quot;Load&amp;quot;];
B --&amp;gt; C[&amp;quot;Split/Chunk&amp;quot;];
C --&amp;gt; D[&amp;quot;Embed&amp;quot;];
D --&amp;gt; E[&amp;quot;Store/Index&amp;quot;];
end
A --&amp;gt; A_Details(&amp;quot;e.g.: PDFs, .txt, .md, Notion, Confluence, databases&amp;quot;);
B --&amp;gt; B_Details(&amp;quot;Using data loaders, e.g., LlamaIndex Readers&amp;quot;);
C --&amp;gt; C_Details(&amp;quot;Strategies: fixed size, recursive splitting, semantic chunking&amp;quot;);
D --&amp;gt; D_Details(&amp;quot;Using Embedding models, e.g., BERT, Sentence-BERT, a-e-5-large-v2&amp;quot;);
E --&amp;gt; E_Details(&amp;quot;Store in vector databases, e.g., Chroma, Pinecone, FAISS&amp;quot;);
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Process Details:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Load&lt;/strong>: The system first needs to load original documents from various specified data sources. These sources can be diverse, such as PDF files, Markdown documents, web pages, Notion pages, database records, etc. Modern RAG frameworks (like LlamaIndex, LangChain) provide rich data loader ecosystems to simplify this process.&lt;/li>
&lt;li>&lt;strong>Split/Chunk&lt;/strong>: Due to the limited context window of language models, directly embedding a long document (like a PDF with hundreds of pages) as a single vector performs poorly and loses many details. Therefore, it's essential to split long texts into smaller, semantically complete chunks. The chunking strategy is crucial and directly affects retrieval precision.&lt;/li>
&lt;li>&lt;strong>Embed&lt;/strong>: This is the core step of transforming textual information into machine-understandable mathematical representations. The system uses a pre-trained embedding model to map each text chunk to a high-dimensional vector. This vector captures the semantic information of the text, with semantically similar text chunks being closer to each other in the vector space.&lt;/li>
&lt;li>&lt;strong>Store/Index&lt;/strong>: Finally, the system stores the vector representations of all text chunks along with their metadata (such as source document, chapter, page number, etc.) in a specialized database, typically a vector database. Vector databases are specially optimized to support efficient similarity searches across massive-scale vector data.&lt;/li>
&lt;/ol>
&lt;h3 id="22-online-process-retrieval--generation">2.2 Online Process: Retrieval &amp;amp; Generation&lt;/h3>
&lt;p>This process is triggered when a user submits a query, with the goal of generating precise, evidence-based answers in real-time.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;User Query&amp;quot;] --&amp;gt; B[&amp;quot;Embed Query&amp;quot;];
B --&amp;gt; C[&amp;quot;Vector Search&amp;quot;];
C &amp;lt;--&amp;gt; D[&amp;quot;Vector Database&amp;quot;];
C --&amp;gt; E[&amp;quot;Get Top-K Chunks&amp;quot;];
E --&amp;gt; F[&amp;quot;(Optional) Re-ranking&amp;quot;];
A &amp;amp; F --&amp;gt; G[&amp;quot;Build Prompt&amp;quot;];
G --&amp;gt; H[&amp;quot;LLM Generation&amp;quot;];
H --&amp;gt; I[&amp;quot;Final Answer&amp;quot;];
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Process Details:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Embed Query&lt;/strong>: When a user inputs a question, the system uses the &lt;strong>same embedding model&lt;/strong> as in the indexing phase to convert this question into a query vector.&lt;/li>
&lt;li>&lt;strong>Vector Search&lt;/strong>: The system takes this query vector and performs a similarity search in the vector database. The most common algorithm is &amp;ldquo;K-Nearest Neighbors&amp;rdquo; (KNN), aiming to find the K text chunk vectors closest to the query vector in the vector space.&lt;/li>
&lt;li>&lt;strong>Get Top-K Chunks&lt;/strong>: Based on the search results, the system retrieves the original content of these K most relevant text chunks from the database. These K text chunks form the core context for answering the question.&lt;/li>
&lt;li>&lt;strong>Re-ranking (Optional)&lt;/strong>: In some advanced RAG systems, there's an additional re-ranking step. This is because high vector similarity doesn't always equate to high relevance to the question. A re-ranker is a lighter-weight model that re-examines the relevance of these Top-K text chunks to the original question and reorders them, selecting the highest quality ones as the final context.&lt;/li>
&lt;li>&lt;strong>Build Prompt&lt;/strong>: The system combines the original question and the filtered context information according to a predefined template into a complete prompt. This prompt typically includes instructions like: &amp;ldquo;Please answer this question based on the following context information. Question: [&amp;hellip;] Context: [&amp;hellip;]&amp;quot;.&lt;/li>
&lt;li>&lt;strong>LLM Generation&lt;/strong>: Finally, this enhanced prompt is sent to the large language model (LLM). The LLM, following the instructions, comprehensively utilizes its internal knowledge and the provided context to generate a fluent, accurate, and information-rich answer. The system can also cite the sources of the context, enhancing the credibility of the answer.&lt;/li>
&lt;/ol>
&lt;h2 id="3-indexing-deep-dive">3. Indexing Deep Dive&lt;/h2>
&lt;p>Indexing is the cornerstone of RAG systems. The quality of this process directly determines the effectiveness of subsequent retrieval and generation phases. A well-designed indexing process ensures that information in the knowledge base is accurately and completely transformed into retrievable units. Let's explore each component in depth.&lt;/p>
&lt;h3 id="31-data-loading">3.1 Data Loading&lt;/h3>
&lt;p>The first step is to load raw data from various sources into the processing pipeline.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Loaders&lt;/strong>: Modern RAG frameworks provide powerful loader ecosystems. For example, LangChain's &lt;code>Document Loaders&lt;/code> support loading data from over 100 different sources, including:
&lt;ul>
&lt;li>&lt;strong>Files&lt;/strong>: &lt;code>TextLoader&lt;/code> (plain text), &lt;code>PyPDFLoader&lt;/code> (PDF), &lt;code>JSONLoader&lt;/code>, &lt;code>CSVLoader&lt;/code>, &lt;code>UnstructuredFileLoader&lt;/code> (capable of processing Word, PowerPoint, HTML, XML, and other formats).&lt;/li>
&lt;li>&lt;strong>Web Content&lt;/strong>: &lt;code>WebBaseLoader&lt;/code> (web scraping), &lt;code>YoutubeLoader&lt;/code> (loading YouTube video captions).&lt;/li>
&lt;li>&lt;strong>Collaboration Platforms&lt;/strong>: &lt;code>NotionDirectoryLoader&lt;/code>, &lt;code>ConfluenceLoader&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Databases&lt;/strong>: &lt;code>AzureCosmosDBLoader&lt;/code>, &lt;code>PostgresLoader&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Choosing the right loader allows enterprises to easily integrate their existing knowledge assets into RAG systems without complex data format conversions.&lt;/p>
&lt;h3 id="32-text-splitting--chunking">3.2 Text Splitting / Chunking&lt;/h3>
&lt;p>&lt;strong>Why is chunking necessary?&lt;/strong>
Directly vectorizing an entire document (like a PDF with hundreds of pages) is impractical for three reasons:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Context Length Limitations&lt;/strong>: Most embedding models and LLMs have token input limits.&lt;/li>
&lt;li>&lt;strong>Noise Issues&lt;/strong>: A single vector representing a lengthy document contains too many topics and details, diluting the semantic information and making it difficult to precisely match specific user questions during retrieval.&lt;/li>
&lt;li>&lt;strong>Retrieval Cost&lt;/strong>: Feeding an entire document as context to an LLM consumes substantial computational resources and costs.&lt;/li>
&lt;/ol>
&lt;p>Therefore, splitting documents into semantically related chunks is a crucial step. &lt;strong>The quality of chunks determines the ceiling of RAG performance.&lt;/strong>&lt;/p>
&lt;h4 id="321-core-parameters-chunksize-and-chunkoverlap">3.2.1 Core Parameters: &lt;code>chunk_size&lt;/code> and &lt;code>chunk_overlap&lt;/code>&lt;/h4>
&lt;ul>
&lt;li>&lt;code>chunk_size&lt;/code>: Defines the size of each text block, typically calculated in character count or token count. Choosing this value requires balancing &amp;ldquo;information density&amp;rdquo; and &amp;ldquo;context completeness.&amp;rdquo; Too small may fragment complete semantics; too large may introduce excessive noise.&lt;/li>
&lt;li>&lt;code>chunk_overlap&lt;/code>: Defines the number of characters (or tokens) that overlap between adjacent text blocks. Setting overlap can effectively prevent cutting off a complete sentence or paragraph at block boundaries, ensuring semantic continuity.&lt;/li>
&lt;/ul>
&lt;h4 id="322-mainstream-chunking-strategies">3.2.2 Mainstream Chunking Strategies&lt;/h4>
&lt;p>The choice of chunking strategy depends on the structure and content of the document.&lt;/p>
&lt;p>&lt;strong>Strategy 1: Character Splitting&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Representative&lt;/strong>: &lt;code>CharacterTextSplitter&lt;/code>&lt;/li>
&lt;li>&lt;strong>Principle&lt;/strong>: This is the simplest direct method. It splits text based on a fixed character (like &lt;code>\n\n&lt;/code> newline) and then forcibly chunks according to the preset &lt;code>chunk_size&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: Simple, fast, low computational cost.&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Completely ignores the semantics and logical structure of the text, easily breaking sentences in the middle or abruptly cutting off complete concept descriptions.&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: Suitable for texts with no obvious structure or where semantic coherence is not a high requirement.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python"># Example: CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator=&amp;quot;\n\n&amp;quot;,
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Strategy 2: Recursive Character Splitting&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Representative&lt;/strong>: &lt;code>RecursiveCharacterTextSplitter&lt;/code>&lt;/li>
&lt;li>&lt;strong>Principle&lt;/strong>: This is currently the most commonly used and recommended strategy. It attempts to split recursively according to a set of preset separators (like &lt;code>[&amp;quot;\n\n&amp;quot;, &amp;quot;\n&amp;quot;, &amp;quot; &amp;quot;, &amp;quot;&amp;quot;]&lt;/code>). It first tries to split using the first separator (&lt;code>\n\n&lt;/code>, paragraph); if the resulting blocks are still larger than &lt;code>chunk_size&lt;/code>, it continues using the next separator (&lt;code>\n&lt;/code>, line) to split these large blocks, and so on until the block size meets requirements.&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: Makes the greatest effort to maintain the integrity of paragraphs, sentences, and other semantic units, striking a good balance between universality and effectiveness.&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Still based on character rules rather than true semantic understanding.&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: The preferred strategy for the vast majority of scenarios.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python"># Example: RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Strategy 3: Token-Based Splitting&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Representative&lt;/strong>: &lt;code>TokenTextSplitter&lt;/code>, &lt;code>CharacterTextSplitter.from_tiktoken_encoder&lt;/code>&lt;/li>
&lt;li>&lt;strong>Principle&lt;/strong>: It calculates &lt;code>chunk_size&lt;/code> by token count rather than character count. This is more consistent with how language models process text and allows for more precise control over the length of content input to the model.&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: More precise control over cost and input length for model API calls.&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Computation is slightly more complex than character splitting.&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: When strict control over costs and API call input lengths is needed.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Strategy 4: Semantic Chunking&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Principle&lt;/strong>: This is a more advanced experimental method. Instead of being based on fixed rules, it's based on understanding the semantics of the text. The splitter calculates embedding similarity between sentences and splits when it detects that the semantic difference between adjacent sentences exceeds a threshold.&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: Can generate highly semantically consistent text blocks, theoretically the best splitting method.&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Very high computational cost, as it requires multiple embedding calculations during the splitting phase.&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: Scenarios requiring extremely high retrieval quality, regardless of computational cost.&lt;/li>
&lt;/ul>
&lt;h3 id="33-embedding">3.3 Embedding&lt;/h3>
&lt;p>Embedding is the process of transforming text chunks into high-dimensional numerical vectors, which serve as mathematical representations of the text's semantics.&lt;/p>
&lt;h4 id="331-embedding-model-selection">3.3.1 Embedding Model Selection&lt;/h4>
&lt;p>The choice of embedding model directly affects retrieval quality and system cost.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Closed-Source Commercial Models (e.g., OpenAI)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Representatives&lt;/strong>: &lt;code>text-embedding-ada-002&lt;/code>, &lt;code>text-embedding-3-small&lt;/code>, &lt;code>text-embedding-3-large&lt;/code>&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: Powerful performance, typically ranking high in various evaluation benchmarks, simple to use (API calls).&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Requires payment, data must be sent to third-party servers, privacy risks exist.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python"># Example: Using OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(model=&amp;quot;text-embedding-3-small&amp;quot;)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Open-Source Models (e.g., Hugging Face)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Representatives&lt;/strong>: &lt;code>sentence-transformers/all-mpnet-base-v2&lt;/code> (English general), &lt;code>bge-large-zh-v1.5&lt;/code> (Chinese), &lt;code>m3e-large&lt;/code> (Chinese-English) etc.&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>: Free, can be deployed locally, no data privacy leakage risk, numerous fine-tuned models available for specific languages or domains.&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>: Requires self-management of model deployment and computational resources, performance may have some gap compared to top commercial models.&lt;/li>
&lt;li>&lt;strong>MTEB Leaderboard&lt;/strong>: The Massive Text Embedding Benchmark (MTEB) is a public leaderboard for evaluating and comparing the performance of different embedding models, an important reference for selecting open-source models.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python"># Example: Using open-source models from Hugging Face
from langchain_huggingface import HuggingFaceEmbeddings
model_name = &amp;quot;sentence-transformers/all-mpnet-base-v2&amp;quot;
embeddings_model = HuggingFaceEmbeddings(model_name=model_name)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Core Principle&lt;/strong>: Throughout the entire RAG process, &lt;strong>the same embedding model must be used in both the indexing phase and the online retrieval phase&lt;/strong>. Otherwise, the query vectors and document vectors will exist in different vector spaces, making meaningful similarity comparisons impossible.&lt;/p>
&lt;h2 id="4-retrieval-technology-deep-dive">4. Retrieval Technology Deep Dive&lt;/h2>
&lt;p>Retrieval is the &amp;ldquo;heart&amp;rdquo; of RAG systems. Finding the most relevant contextual information is the prerequisite for generating high-quality answers. If the retrieved content is irrelevant or inaccurate, even the most powerful LLM will be ineffective - this is the so-called &amp;ldquo;Garbage In, Garbage Out&amp;rdquo; principle.&lt;/p>
&lt;p>Retrieval technology has evolved from traditional keyword matching to modern semantic vector search, and has now developed various advanced strategies to address complex challenges in different scenarios.&lt;/p>
&lt;h3 id="41-traditional-foundation-sparse-retrieval">4.1 Traditional Foundation: Sparse Retrieval&lt;/h3>
&lt;p>Sparse retrieval is a classic information retrieval method based on word frequency statistics, independent of deep learning models. Its core idea is that the more times a word appears in a specific document and the fewer times it appears across all documents, the more representative that word is for that document.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Representative Algorithms&lt;/strong>: &lt;strong>TF-IDF&lt;/strong> &amp;amp; &lt;strong>BM25 (Best Match 25)&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Principle Brief (using BM25 as an example)&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Term Frequency (TF)&lt;/strong>: Calculate the frequency of each query term in the document.&lt;/li>
&lt;li>&lt;strong>Inverse Document Frequency (IDF)&lt;/strong>: Measure the &amp;ldquo;rarity&amp;rdquo; of a term. Rarer terms have higher weights.&lt;/li>
&lt;li>&lt;strong>Document Length Penalty&lt;/strong>: Penalize overly long documents to prevent them from getting artificially high scores just because they contain more words.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Precise Keyword Matching&lt;/strong>: Performs excellently for queries containing specific terms, abbreviations, or product models (like &amp;ldquo;iPhone 15 Pro&amp;rdquo;).&lt;/li>
&lt;li>&lt;strong>Strong Interpretability&lt;/strong>: Score calculation logic is clear, easy to understand and debug.&lt;/li>
&lt;li>&lt;strong>Fast Computation&lt;/strong>: No complex model inference required.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Cannot Understand Semantics&lt;/strong>: Unable to handle synonyms, near-synonyms, or conceptual relevance. For example, searching for &amp;ldquo;Apple phone&amp;rdquo; won't match documents containing &amp;ldquo;iPhone&amp;rdquo;.&lt;/li>
&lt;li>&lt;strong>&amp;ldquo;Vocabulary Gap&amp;rdquo; Problem&lt;/strong>: Relies on literal matching between queries and documents.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: As part of hybrid retrieval, handling keyword and proper noun matching.&lt;/li>
&lt;/ul>
&lt;h3 id="42-modern-core-dense-retrieval--vector-search">4.2 Modern Core: Dense Retrieval / Vector Search&lt;/h3>
&lt;p>Dense retrieval is the mainstream technology in current RAG systems. It uses deep learning models (the embedding models we discussed earlier) to encode the semantic information of text into dense vectors, enabling retrieval based on &amp;ldquo;semantic similarity&amp;rdquo; rather than &amp;ldquo;literal similarity&amp;rdquo;.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Semantically similar texts have vectors that are close to each other in multidimensional space.&lt;/li>
&lt;li>&lt;strong>Workflow&lt;/strong>:
&lt;ol>
&lt;li>Offline: Vectorize all document chunks and store them in a vector database.&lt;/li>
&lt;li>Online: Vectorize the user query.&lt;/li>
&lt;li>In the vector database, calculate the distance/similarity between the query vector and all document vectors (such as cosine similarity, Euclidean distance).&lt;/li>
&lt;li>Return the Top-K document chunks with the closest distances.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;h4 id="421-approximate-nearest-neighbor-ann-search">4.2.1 Approximate Nearest Neighbor (ANN) Search&lt;/h4>
&lt;p>Since performing exact &amp;ldquo;nearest neighbor&amp;rdquo; searches among millions or even billions of vectors is extremely computationally expensive, the industry widely adopts &lt;strong>Approximate Nearest Neighbor (ANN)&lt;/strong> algorithms. ANN sacrifices minimal precision in exchange for query speed improvements of several orders of magnitude.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Mainstream ANN Algorithm&lt;/strong>: &lt;strong>HNSW (Hierarchical Navigable Small World)&lt;/strong>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>HNSW Principle Brief&lt;/strong>: It constructs a hierarchical graph structure. In the higher-level graph, it performs rough, large-step searches to quickly locate the target area; then in the lower-level graph, it performs fine, small-step searches to finally find the nearest neighbor vectors. This is like finding an address in a city - first determining which district (higher level), then which street (lower level).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Powerful Semantic Understanding&lt;/strong>: Can cross literal barriers to understand concepts and intentions.&lt;/li>
&lt;li>&lt;strong>High Recall Rate&lt;/strong>: Can retrieve more semantically relevant documents with different wording.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Disadvantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Keyword Insensitivity&lt;/strong>: Sometimes less effective than sparse retrieval for matching specific keywords or proper nouns.&lt;/li>
&lt;li>&lt;strong>Strong Dependence on Embedding Models&lt;/strong>: Effectiveness completely depends on the quality of the embedding model.&lt;/li>
&lt;li>&lt;strong>&amp;ldquo;Black Box&amp;rdquo; Problem&lt;/strong>: The process of generating and matching vectors is less intuitive than sparse retrieval.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="43-powerful-combination-hybrid-search">4.3 Powerful Combination: Hybrid Search&lt;/h3>
&lt;p>Since sparse retrieval and dense retrieval each have their own strengths and weaknesses, the most natural idea is to combine them to leverage their respective advantages. Hybrid search was born for this purpose.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Implementation Method&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Parallel Execution&lt;/strong>: Simultaneously process user queries using sparse retrieval (like BM25) and dense retrieval (vector search).&lt;/li>
&lt;li>&lt;strong>Score Fusion&lt;/strong>: Obtain two sets of results and their corresponding scores.&lt;/li>
&lt;li>&lt;strong>Result Re-ranking&lt;/strong>: Use a fusion algorithm (such as &lt;strong>Reciprocal Rank Fusion, RRF&lt;/strong>) to merge the two sets of results and re-rank them based on the fused scores to get the final Top-K results. The RRF algorithm gives higher weight to documents that rank high in different retrieval methods.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Hybrid Search&amp;quot;
A[&amp;quot;User Query&amp;quot;] --&amp;gt; B[&amp;quot;BM25 Retriever&amp;quot;];
A --&amp;gt; C[&amp;quot;Vector Retriever&amp;quot;];
B --&amp;gt; D[&amp;quot;Sparse Results (Top-K)&amp;quot;];
C --&amp;gt; E[&amp;quot;Dense Results (Top-K)&amp;quot;];
D &amp;amp; E --&amp;gt; F{&amp;quot;Fusion &amp;amp; Reranking (e.g., RRF)&amp;quot;};
F --&amp;gt; G[&amp;quot;Final Ranked Results&amp;quot;];
end
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>: Balances the precision of keyword matching and the breadth of semantic understanding, achieving better results than single retrieval methods in most scenarios.&lt;/li>
&lt;li>&lt;strong>Applicable Scenarios&lt;/strong>: Almost all RAG applications requiring high-quality retrieval.&lt;/li>
&lt;/ul>
&lt;h3 id="44-frontier-exploration-advanced-retrieval-strategies">4.4 Frontier Exploration: Advanced Retrieval Strategies&lt;/h3>
&lt;p>To address more complex query intentions and data structures, academia and industry have developed a series of advanced retrieval strategies.&lt;/p>
&lt;h4 id="441-contextual-compression--reranking">4.4.1 Contextual Compression &amp;amp; Re-ranking&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: The Top-K document chunks returned by vector search may only partially contain content truly relevant to the question, and some high-ranking blocks might actually be &amp;ldquo;false positives.&amp;rdquo; Directly feeding this redundant or irrelevant information to the LLM increases noise and cost.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Add an intermediate &amp;ldquo;filtering&amp;rdquo; and &amp;ldquo;sorting&amp;rdquo; layer between retrieval and generation.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Initial Retrieval&amp;quot;] --&amp;gt; B[&amp;quot;Top-K Documents&amp;quot;];
B --&amp;gt; C{&amp;quot;Compressor / Re-ranker&amp;quot;};
UserQuery --&amp;gt; C;
C --&amp;gt; D[&amp;quot;Filtered &amp;amp; Re-ranked Documents&amp;quot;];
D --&amp;gt; E[&amp;quot;LLM Generation&amp;quot;];
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Implementation Method&lt;/strong>: Using LangChain's &lt;code>ContextualCompressionRetriever&lt;/code>.
&lt;ul>
&lt;li>&lt;strong>&lt;code>LLMChainExtractor&lt;/code>&lt;/strong>: Uses an LLM to judge whether each document chunk is relevant to the query and only extracts relevant sentences.&lt;/li>
&lt;li>&lt;strong>&lt;code>EmbeddingsFilter&lt;/code>&lt;/strong>: Recalculates the similarity between query vectors and document chunk vectors, filtering out documents below a certain threshold.&lt;/li>
&lt;li>&lt;strong>Re-ranker&lt;/strong>: This is currently the most effective and commonly used approach. It uses a lighter-weight &lt;strong>cross-encoder&lt;/strong> model specifically trained to calculate relevance scores. Unlike the bi-encoder used in the retrieval phase (which encodes queries and documents separately), a cross-encoder receives both the query and document chunk as input simultaneously, enabling more fine-grained relevance judgment. Common re-rankers include &lt;code>Cohere Rerank&lt;/code>, &lt;code>BAAI/bge-reranker-*&lt;/code>, and models provided by open-source or cloud service vendors.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="442-selfquerying-retriever">4.4.2 Self-Querying Retriever&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: User queries are typically in natural language but may contain filtering requirements for &lt;strong>metadata&lt;/strong>. For example: &amp;ldquo;Recommend some science fiction movies released after 2000 with ratings above 8.5?&amp;rdquo;&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Let the LLM itself &amp;ldquo;translate&amp;rdquo; natural language queries into structured query statements containing metadata filtering conditions.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Workflow&lt;/strong>:
&lt;ol>
&lt;li>User inputs a natural language query.&lt;/li>
&lt;li>&lt;code>SelfQueryingRetriever&lt;/code> sends the query to the LLM.&lt;/li>
&lt;li>Based on predefined metadata field information (such as &lt;code>year&lt;/code>, &lt;code>rating&lt;/code>, &lt;code>genre&lt;/code>), the LLM generates a structured query containing:
&lt;ul>
&lt;li>&lt;code>query&lt;/code>: The keyword part for vector search (&amp;ldquo;science fiction movies&amp;rdquo;).&lt;/li>
&lt;li>&lt;code>filter&lt;/code>: Conditions for metadata filtering (&lt;code>year &amp;gt; 2000 AND rating &amp;gt; 8.5&lt;/code>).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>The retriever uses this structured query to perform a &amp;ldquo;filter first, then search&amp;rdquo; operation on the vector database, greatly narrowing the search scope and improving precision.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-python"># Core settings for Self-Querying in LangChain
metadata_field_info = [
AttributeInfo(name=&amp;quot;genre&amp;quot;, ...),
AttributeInfo(name=&amp;quot;year&amp;quot;, ...),
AttributeInfo(name=&amp;quot;rating&amp;quot;, ...),
]
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
)
&lt;/code>&lt;/pre>
&lt;h4 id="443-multivector-retriever">4.4.3 Multi-Vector Retriever&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: A single vector struggles to perfectly summarize a longer document chunk, especially when the chunk contains multiple subtopics.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Generate &lt;strong>multiple&lt;/strong> vectors representing different aspects for each document chunk, rather than a single vector.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Implementation Methods&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Smaller Sub-chunks&lt;/strong>: Further split the original document chunk into smaller sentences or paragraphs, and generate vectors for these small chunks.&lt;/li>
&lt;li>&lt;strong>Summary Vectors&lt;/strong>: Use an LLM to generate a summary for each document chunk, then vectorize the summary.&lt;/li>
&lt;li>&lt;strong>Hypothetical Question Vectors&lt;/strong>: Use an LLM to pose several possible questions about each document chunk, then vectorize these questions.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;p>During querying, the query vector matches with all these sub-vectors (sub-chunks, summaries, questions). Once a match is successful, what's returned is the &lt;strong>complete original document chunk&lt;/strong> it belongs to. This leverages both the precision of fine-grained matching and ensures that the context provided to the final LLM is complete.&lt;/p>
&lt;h4 id="444-parent-document-retriever">4.4.4 Parent Document Retriever&lt;/h4>
&lt;p>This is a common implementation of the multi-vector retriever. It splits documents into &amp;ldquo;parent chunks&amp;rdquo; and &amp;ldquo;child chunks.&amp;rdquo; Indexing and retrieval happen on the smaller &amp;ldquo;child chunks,&amp;rdquo; but what's ultimately returned to the LLM is the larger &amp;ldquo;parent chunk&amp;rdquo; that the child belongs to. This solves the &amp;ldquo;context loss&amp;rdquo; problem, ensuring that the LLM sees a more complete linguistic context when generating answers.&lt;/p>
&lt;h4 id="445-graph-rag">4.4.5 Graph RAG&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: Traditional RAG views knowledge as independent text blocks, ignoring the complex, web-like relationships between knowledge points.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Build the knowledge base into a &lt;strong>Knowledge Graph&lt;/strong>, where entities are nodes and relationships are edges.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Workflow&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>During querying, the system first identifies the core entities in the query.&lt;/li>
&lt;li>It then explores neighboring nodes and relationships related to these entities in the graph, forming a subgraph containing rich structured information.&lt;/li>
&lt;li>This subgraph information is linearized (converted to text) and provided to the LLM as context.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Advantages&lt;/strong>: Can answer more complex questions requiring multi-hop reasoning (e.g., &amp;ldquo;Who is A's boss's wife?&amp;quot;), providing deeper context than &amp;ldquo;text blocks.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Implementation Case: Graphiti/Zep&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Introduction&lt;/strong>: &lt;a href="https://github.com/getzep/graphiti">Graphiti&lt;/a> is a temporal knowledge graph architecture designed specifically for LLM Agents, seamlessly integrating Neo4j's graph database capabilities with LLM's natural language processing abilities.&lt;/li>
&lt;li>&lt;strong>Core Features&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Temporal Awareness&lt;/strong>: Each node and relationship carries timestamp attributes, enabling tracking of how entity states change over time.&lt;/li>
&lt;li>&lt;strong>Automatic Schema Inference&lt;/strong>: No need to predefine entity types and relationships; the system can automatically infer appropriate graph structures from conversations.&lt;/li>
&lt;li>&lt;strong>Multi-hop Reasoning&lt;/strong>: Supports complex relationship path queries, capable of discovering indirectly associated information.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Application Scenarios&lt;/strong>: Particularly suitable for multi-turn dialogue systems requiring long-term memory and temporal reasoning, such as customer support, personal assistants, and other scenarios needing to &amp;ldquo;remember&amp;rdquo; user historical interactions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="446-agentic-rag--adaptive-rag">4.4.6 Agentic RAG / Adaptive RAG&lt;/h4>
&lt;p>This is the latest evolutionary direction of RAG, endowing RAG systems with certain &amp;ldquo;thinking&amp;rdquo; and &amp;ldquo;decision-making&amp;rdquo; capabilities, allowing them to adaptively select the best retrieval strategy based on the complexity of the question.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Transform the traditional linear RAG process into a dynamic process driven by an LLM Agent that can loop and iterate.&lt;/li>
&lt;li>&lt;strong>Possible Workflow&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Question Analysis&lt;/strong>: The Agent first analyzes the user's question. Is this a simple question or a complex one? Does it need keyword matching or semantic search?&lt;/li>
&lt;li>&lt;strong>Strategy Selection&lt;/strong>:
&lt;ul>
&lt;li>If the question is simple, directly perform vector search.&lt;/li>
&lt;li>If the question contains metadata, switch to Self-Querying.&lt;/li>
&lt;li>If the question is ambiguous, the Agent might first rewrite the question (Query Rewriting), generating several different query variants and executing them separately.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Result Reflection &amp;amp; Iteration&lt;/strong>: The Agent examines the preliminary retrieved results. If the results are not ideal (e.g., low relevance or conflicting information), it can decide to:
&lt;ul>
&lt;li>&lt;strong>Query Again&lt;/strong>: Use different keywords or strategies to retrieve again.&lt;/li>
&lt;li>&lt;strong>Web Search&lt;/strong>: If the internal knowledge base doesn't have an answer, it can call search engine tools to find information online.&lt;/li>
&lt;li>&lt;strong>Multi-step Reasoning&lt;/strong>: Break down complex questions into several sub-questions, retrieving and answering step by step.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;p>Agentic RAG is no longer a fixed pipeline but a flexible, intelligent framework, representing the future direction of RAG development.&lt;/p>
&lt;h2 id="5-generation-phase-the-final-touch">5. Generation Phase: The Final Touch&lt;/h2>
&lt;p>The generation phase is the endpoint of the RAG process and the ultimate manifestation of its value. In this phase, the system combines the &amp;ldquo;essence&amp;rdquo; context obtained from previous retrieval, filtering, and re-ranking with the user's original question to form a final prompt, which is then sent to the large language model (LLM) to generate an answer.&lt;/p>
&lt;h3 id="51-core-task-effective-prompt-engineering">5.1 Core Task: Effective Prompt Engineering&lt;/h3>
&lt;p>The core task of this phase is &lt;strong>Prompt Engineering&lt;/strong>. A well-designed prompt template can clearly instruct the LLM on its task, ensuring it thinks and answers along the right track.&lt;/p>
&lt;p>A typical RAG prompt template structure is as follows:&lt;/p>
&lt;pre>&lt;code class="language-text">You are a professional, rigorous Q&amp;amp;A assistant. Please answer the user's question based on the context information provided below.
Your answer must be completely based on the given context, and you are prohibited from using your internal knowledge for any supplementation or imagination.
If there is not enough information in the context to answer the question, please clearly state &amp;quot;Based on the available information, I cannot answer this question.&amp;quot;
At the end of your answer, please list all the context source IDs you referenced.
---
[Context Information]
{context}
---
[User Question]
{question}
---
[Your Answer]
&lt;/code>&lt;/pre>
&lt;h4 id="511-template-key-elements-analysis">5.1.1 Template Key Elements Analysis&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Persona&lt;/strong>: &amp;ldquo;You are a professional, rigorous Q&amp;amp;A assistant.&amp;rdquo; This helps set the tone and style of the LLM's output.&lt;/li>
&lt;li>&lt;strong>Core Instruction&lt;/strong>: &amp;ldquo;Please answer the user's question based on the context information provided below.&amp;rdquo; This is the most critical task instruction.&lt;/li>
&lt;li>&lt;strong>Constraints &amp;amp; Guardrails&lt;/strong>:
&lt;ul>
&lt;li>&amp;ldquo;Must be completely based on the given context, prohibited from&amp;hellip; supplementation or imagination.&amp;rdquo; -&amp;gt; This is key to suppressing model hallucinations.&lt;/li>
&lt;li>&amp;ldquo;If there is not enough information, please clearly state&amp;hellip;&amp;rdquo; -&amp;gt; This defines the model's &amp;ldquo;escape route&amp;rdquo; when information is insufficient, preventing it from guessing.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Attribution/Citation&lt;/strong>: &amp;ldquo;Please list all the context source IDs you referenced.&amp;rdquo; -&amp;gt; This is the foundation for answer explainability and credibility.&lt;/li>
&lt;li>&lt;strong>Placeholders&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>{context}&lt;/code>: This will be filled with the content of multiple document chunks (chunks) obtained from the retrieval phase, after processing.&lt;/li>
&lt;li>&lt;code>{question}&lt;/code>: This will be filled with the user's original question.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="52-context-and-question-fusion">5.2 Context and Question Fusion&lt;/h3>
&lt;p>When the system fills multiple document chunks (e.g., Top-5 chunks) into the &lt;code>{context}&lt;/code> placeholder, these chunks are packaged together with the original question and sent to the LLM. The LLM reads the entire enhanced prompt and then:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Understands the Question&lt;/strong>: Clarifies the user's query intent.&lt;/li>
&lt;li>&lt;strong>Locates Information&lt;/strong>: Searches for sentences and paragraphs directly related to the question within the provided multiple context blocks.&lt;/li>
&lt;li>&lt;strong>Synthesizes &amp;amp; Refines&lt;/strong>: Integrates, understands, and refines scattered information points found from different context blocks.&lt;/li>
&lt;li>&lt;strong>Generates an Answer&lt;/strong>: Based on the refined information, generates a final answer using fluent, coherent natural language.&lt;/li>
&lt;li>&lt;strong>Cites Sources&lt;/strong>: According to instructions, includes the document sources that the answer is based on.&lt;/li>
&lt;/ol>
&lt;p>Through this carefully designed &amp;ldquo;open-book exam&amp;rdquo; process, the RAG system ultimately generates a high-quality answer that combines both the LLM's powerful language capabilities and fact-based information.&lt;/p>
&lt;h2 id="6-rag-evaluation-framework-how-to-measure-system-quality">6. RAG Evaluation Framework: How to Measure System Quality?&lt;/h2>
&lt;p>Building a RAG system is just the first step. Scientifically and quantitatively evaluating its performance, and continuously iterating and optimizing based on this evaluation, is equally important. A good evaluation framework can help us diagnose whether the system's bottleneck is in the retrieval module (&amp;ldquo;not found&amp;rdquo;) or in the generation module (&amp;ldquo;not well expressed&amp;rdquo;).&lt;/p>
&lt;p>Industry-leading RAG evaluation frameworks, such as &lt;strong>RAGAS (RAG Assessment)&lt;/strong> and &lt;strong>TruLens&lt;/strong>, provide a series of metrics to score RAG system performance from different dimensions.&lt;/p>
&lt;h3 id="61-core-evaluation-dimensions">6.1 Core Evaluation Dimensions&lt;/h3>
&lt;p>RAG evaluation can be divided into two levels: &lt;strong>component level&lt;/strong> (evaluating retrieval and generation separately) and &lt;strong>end-to-end level&lt;/strong> (evaluating the quality of the final answer).&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;RAG Evaluation Dimensions&amp;quot;
A(&amp;quot;Evaluation&amp;quot;) --&amp;gt; B[&amp;quot;Component-Level Evaluation&amp;quot;];
A --&amp;gt; C[&amp;quot;End-to-End Evaluation&amp;quot;];
B --&amp;gt; B1[&amp;quot;Retriever Quality Evaluation&amp;quot;];
B --&amp;gt; B2[&amp;quot;Generator Quality Evaluation&amp;quot;];
B1 --&amp;gt; B1_Metrics(&amp;quot;Context Precision, Context Recall&amp;quot;);
B2 --&amp;gt; B2_Metrics(&amp;quot;Faithfulness&amp;quot;);
C --&amp;gt; C_Metrics(&amp;quot;Answer Relevancy, Answer Correctness&amp;quot;);
end
&lt;/code>&lt;/pre>
&lt;h3 id="62-key-evaluation-metrics-using-ragas-as-an-example">6.2 Key Evaluation Metrics (Using RAGAS as an Example)&lt;/h3>
&lt;p>Below we explain in detail several core metrics in the RAGAS framework. These metrics do not require manually annotated reference answers (Reference-Free), greatly reducing evaluation costs.&lt;/p>
&lt;h4 id="621-evaluating-generation-quality">6.2.1 Evaluating Generation Quality&lt;/h4>
&lt;p>&lt;strong>Metric 1: Faithfulness&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Definition&lt;/strong>: Measures the extent to which the generated answer is completely based on the provided context. High faithfulness means that every statement in the answer can find evidence in the context.&lt;/li>
&lt;li>&lt;strong>Evaluation Method&lt;/strong>: RAGAS uses an LLM to analyze the answer, breaking it down into a series of statements. Then, for each statement, it verifies in the context whether there is evidence supporting that statement. The final score is (number of statements supported by the context) / (total number of statements).&lt;/li>
&lt;li>&lt;strong>Problem Diagnosed&lt;/strong>: This metric is the &lt;strong>core indicator for measuring &amp;ldquo;model hallucination&amp;rdquo;&lt;/strong>. A low score means the generator (LLM) is freely making up information that doesn't exist in the context.&lt;/li>
&lt;li>&lt;strong>Data Required&lt;/strong>: &lt;code>question&lt;/code>, &lt;code>answer&lt;/code>, &lt;code>context&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="622-evaluating-both-retrieval-and-generation-quality">6.2.2 Evaluating Both Retrieval and Generation Quality&lt;/h4>
&lt;p>&lt;strong>Metric 2: Answer Relevancy&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Definition&lt;/strong>: Measures the relevance of the generated answer to the user's original question. An answer faithful to the context might still be off-topic.&lt;/li>
&lt;li>&lt;strong>Evaluation Method&lt;/strong>: RAGAS uses an Embedding model to measure the semantic similarity between the question and answer. It also uses an LLM to identify &amp;ldquo;noise&amp;rdquo; or irrelevant sentences in the answer and penalizes them.&lt;/li>
&lt;li>&lt;strong>Problem Diagnosed&lt;/strong>: A low score means that although the answer may be based on the context, it doesn't directly or effectively answer the user's question, or it contains too much irrelevant information.&lt;/li>
&lt;li>&lt;strong>Data Required&lt;/strong>: &lt;code>question&lt;/code>, &lt;code>answer&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="623-evaluating-retrieval-quality">6.2.3 Evaluating Retrieval Quality&lt;/h4>
&lt;p>&lt;strong>Metric 3: Context Precision&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Definition&lt;/strong>: Measures how much of the retrieved context is truly relevant to the question - the &amp;ldquo;signal-to-noise ratio.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Evaluation Method&lt;/strong>: RAGAS analyzes the context sentence by sentence and has an LLM judge whether each sentence is necessary for answering the user's question. The final score is (number of sentences deemed useful) / (total number of sentences in the context).&lt;/li>
&lt;li>&lt;strong>Problem Diagnosed&lt;/strong>: A low score (high &lt;code>1 - Context Precision&lt;/code> value) indicates that the retriever returned many irrelevant &amp;ldquo;noise&amp;rdquo; documents, which interferes with the generator's judgment and increases costs. This suggests that the &lt;strong>retrieval algorithm needs optimization&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Data Required&lt;/strong>: &lt;code>question&lt;/code>, &lt;code>context&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Metric 4: Context Recall&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Definition&lt;/strong>: Measures whether the retrieved context contains all the necessary information to answer the question.&lt;/li>
&lt;li>&lt;strong>Evaluation Method&lt;/strong>: This metric requires a &lt;strong>manually annotated reference answer (Ground Truth)&lt;/strong> as a benchmark. RAGAS has an LLM analyze this reference answer and judge whether each sentence in it can find support in the retrieved context.&lt;/li>
&lt;li>&lt;strong>Problem Diagnosed&lt;/strong>: A low score means the retriever &lt;strong>failed to find&lt;/strong> key information needed to answer the question, indicating &amp;ldquo;missed retrievals.&amp;rdquo; This might suggest that the document chunking strategy is unreasonable, or the Embedding model cannot understand the query well.&lt;/li>
&lt;li>&lt;strong>Data Required&lt;/strong>: &lt;code>question&lt;/code>, &lt;code>ground_truth&lt;/code> (reference answer), &lt;code>context&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h3 id="63-using-evaluation-to-guide-iteration">6.3 Using Evaluation to Guide Iteration&lt;/h3>
&lt;p>By comprehensively evaluating a RAG system using the above metrics, we can get a clear performance profile and make targeted optimizations:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Low Faithfulness Score&lt;/strong>: The problem is in the &lt;strong>generator&lt;/strong>. Need to optimize the Prompt, add stronger constraints, or switch to an LLM with stronger instruction-following capabilities.&lt;/li>
&lt;li>&lt;strong>Low Answer Relevancy Score&lt;/strong>: The problem could be in either the generator or retriever. Need to check if the Prompt is guiding the model off-topic, or if the retrieved content is of poor quality.&lt;/li>
&lt;li>&lt;strong>Low Context Precision Score&lt;/strong>: The problem is in the &lt;strong>retriever&lt;/strong>. Indicates that the recalled documents are of poor quality with much noise. Can try better retrieval strategies, such as adding a Re-ranker to filter irrelevant documents.&lt;/li>
&lt;li>&lt;strong>Low Context Recall Score&lt;/strong>: The problem is in the &lt;strong>retriever&lt;/strong>. Indicates that key information wasn't found. Need to check if the Chunking strategy is fragmenting key information, or try methods like Multi-Query to expand the retrieval scope.&lt;/li>
&lt;/ul>
&lt;p>Through the &amp;ldquo;evaluate-diagnose-optimize&amp;rdquo; closed loop, we can continuously improve the overall performance of the RAG system.&lt;/p>
&lt;h2 id="7-challenges-and-future-outlook">7. Challenges and Future Outlook&lt;/h2>
&lt;p>Although RAG has greatly expanded the capabilities of large language models and has become the de facto standard for building knowledge-intensive applications, it still faces some challenges while also pointing to exciting future development directions.&lt;/p>
&lt;h3 id="71-current-challenges">7.1 Current Challenges&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>&amp;ldquo;Needle-in-a-Haystack&amp;rdquo; Problem&lt;/strong>: As LLM context windows grow larger (e.g., million-level tokens), precisely finding and utilizing key information in lengthy, noisy contexts becomes increasingly difficult. Research shows that LLM performance when processing long contexts is affected by the position of information within them, with issues like &amp;ldquo;middle neglect.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Imperfect Chunking&lt;/strong>: How to optimally split documents remains an open question. Existing rule-based or simple semantic splitting methods may damage information integrity or introduce irrelevant context, affecting retrieval and generation quality.&lt;/li>
&lt;li>&lt;strong>Evaluation Complexity and Cost&lt;/strong>: Although frameworks like RAGAS provide automated evaluation metrics, building a comprehensive, reliable evaluation set still requires significant human effort. Especially in domains requiring fine judgment, machine evaluation results may differ from human perception.&lt;/li>
&lt;li>&lt;strong>Integration of Structured and Multimodal Data&lt;/strong>: Knowledge in the real world isn't just text. How to efficiently integrate tables, charts, images, audio, and other multimodal information, and enable RAG systems to understand and utilize them, is an actively explored area.&lt;/li>
&lt;li>&lt;strong>Production Environment Complexity&lt;/strong>: Deploying a RAG prototype to a production environment requires considering data updates, permission management, version control, cost monitoring, low-latency responses, and a series of engineering challenges.&lt;/li>
&lt;/ol>
&lt;h3 id="72-future-outlook">7.2 Future Outlook&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Smarter Indexing&lt;/strong>: Future indexing processes will no longer be simple &amp;ldquo;split-vectorize&amp;rdquo; operations. They will more deeply understand document structures, automatically build knowledge graphs, identify entities and relationships, generate multi-level, multi-perspective representations (such as summaries, questions), creating a richer, more queryable knowledge network.&lt;/li>
&lt;li>&lt;strong>Adaptive Retrieval&lt;/strong>: As demonstrated by Agentic RAG, future RAG systems will have stronger autonomy. They can dynamically decide whether to perform simple vector searches or execute complex multi-step queries, or even call external tools (such as search engines, calculators, APIs) to obtain information based on the specific situation of the question. Retrieval will evolve from a fixed step to a flexible, agent-driven process.&lt;/li>
&lt;li>&lt;strong>LLM as Part of RAG&lt;/strong>: As LLM capabilities strengthen, they will participate more deeply in every aspect of RAG. Not just in the generation phase, but also in indexing (generating metadata, summaries), querying (query rewriting, expansion), retrieval (as a re-ranker), and other phases, playing a core role.&lt;/li>
&lt;li>&lt;strong>End-to-End Optimization&lt;/strong>: Future frameworks may allow end-to-end joint fine-tuning of various RAG components (Embedding models, LLM generators, etc.), making the entire system highly optimized for a specific task or domain, rather than simply piecing together individual components.&lt;/li>
&lt;li>&lt;strong>Native Multimodal RAG&lt;/strong>: RAG will natively support understanding and retrieving content like images, audio, and video. Users can ask questions like &amp;ldquo;Find me that picture of &amp;lsquo;a cat playing piano&amp;rsquo;&amp;rdquo; and the system can directly perform semantic retrieval in multimedia databases and return results.&lt;/li>
&lt;/ol>
&lt;p>In summary, RAG is evolving from a relatively fixed &amp;ldquo;retrieve-augment-generate&amp;rdquo; pipeline to a more dynamic, intelligent, adaptive knowledge processing framework. It will continue to serve as the key bridge connecting large language models with the vast external world, continuously unleashing AI's application potential across various industries in the foreseeable future.&lt;/p></description></item><item><title>Model Context Protocol (MCP): A Standardized Framework for AI Capability Extension</title><link>https://ziyanglin.netlify.app/en/post/mcp-documentation/</link><pubDate>Mon, 30 Jun 2025 08:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/mcp-documentation/</guid><description>&lt;h2 id="1-macro-introduction-why-do-we-need-mcp-beyond-tool-calling">1. Macro Introduction: Why Do We Need MCP Beyond Tool Calling?&lt;/h2>
&lt;p>In our previous document on general LLM tool calling, we revealed how LLMs can break their knowledge boundaries by calling external functions. This is a powerful &lt;strong>programming paradigm&lt;/strong>, but it doesn't define a &lt;strong>standardized set of communication rules&lt;/strong>. Each developer must decide for themselves how to organize APIs, manage tools, and handle data formats, leading to ecosystem fragmentation.&lt;/p>
&lt;p>The &lt;strong>Model Context Protocol (MCP)&lt;/strong> was born precisely to solve this problem. It doesn't aim to replace the general concept of tool calling, but rather builds a layer of &lt;strong>standardized, pluggable, service-oriented protocol&lt;/strong> on top of it.&lt;/p>
&lt;p>If &amp;ldquo;tool calling&amp;rdquo; is teaching a car how to &amp;ldquo;refuel&amp;rdquo; (use external capabilities), then MCP establishes &lt;strong>standardized gas stations and fuel nozzle interfaces&lt;/strong> for the world. No matter what car you drive (different LLMs) or what fuel you need (different tools), as long as you follow the MCP standard, you can connect seamlessly and plug-and-play.&lt;/p>
&lt;p>The core value of MCP lies in:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Standardization&lt;/strong>: Defines unified message formats and interaction patterns for communication between models and external tool services. Developers no longer need to customize tool integration solutions for each model or application.&lt;/li>
&lt;li>&lt;strong>Decoupling&lt;/strong>: Completely separates the &lt;strong>implementation&lt;/strong> of tools (running on MCP servers) from their &lt;strong>use&lt;/strong> (initiated by LLMs). Models don't need to know the internal code of tools, only how to communicate with them through the protocol.&lt;/li>
&lt;li>&lt;strong>Reusability&lt;/strong>: Once a tool or data source is encapsulated as an MCP server, it can be easily reused by any model or application that supports the MCP protocol, greatly improving development efficiency.&lt;/li>
&lt;li>&lt;strong>Discoverability&lt;/strong>: MCP makes tools service-oriented, laying the foundation for building tool marketplaces and enabling automatic discovery and orchestration of tools in the future.&lt;/li>
&lt;/ul>
&lt;p>In simple terms, MCP elevates scattered &amp;ldquo;function calls&amp;rdquo; to the level of &amp;ldquo;distributed service calls,&amp;rdquo; serving as a key infrastructure for building scalable, interoperable AI Agent ecosystems.&lt;/p>
&lt;h2 id="2-mcp-core-architecture-a-trinity-collaboration-model">2. MCP Core Architecture: A Trinity Collaboration Model&lt;/h2>
&lt;p>The MCP architecture consists of three core components that interact through clearly defined protocols, forming a solid &amp;ldquo;trinity&amp;rdquo; collaboration model.&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Model/Agent&lt;/strong>: The decision core. It is responsible for understanding user intent and generating requests that follow the MCP format to call external tools or access external resources.&lt;/li>
&lt;li>&lt;strong>MCP Client&lt;/strong>: The communication hub. It serves as a bridge between the model and MCP servers, parsing MCP requests generated by the model, communicating with the corresponding MCP servers through standardized transmission methods (such as Stdio, HTTP SSE), and handling returned results.&lt;/li>
&lt;li>&lt;strong>MCP Server&lt;/strong>: The capability provider. This is a separate process or service that encapsulates one or more tools or data sources and provides standardized access interfaces through the MCP protocol.&lt;/li>
&lt;/ol>
&lt;p>Below is a visual explanation of this architecture:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Agent [Model/Agent]
A[LLM] -- Generates Request --&amp;gt; B(MCP XML Request);
end
subgraph Client [MCP Client]
C{Request Parser};
B -- Parse Request --&amp;gt; C;
end
subgraph LocalServer [MCP Server - Local]
D[Stdio Communication];
end
subgraph RemoteServer [MCP Server - Remote]
E[HTTP SSE Communication];
end
subgraph ServerCore [MCP Server Internal]
F[Protocol Processor] -- Execute Tool --&amp;gt; G[Tool/Resource Implementation];
end
C -- Route to Local --&amp;gt; D;
C -- Route to Remote --&amp;gt; E;
D -- Local Transport --&amp;gt; F;
E -- Remote Transport --&amp;gt; F;
G -- Return Result --&amp;gt; F;
F -- Protocol Return --&amp;gt; C;
C -- Submit Result --&amp;gt; A;
style A fill:#cde4ff,stroke:#333;
style B fill:#e6ffc2,stroke:#333;
style C fill:#fce8b2,stroke:#333;
style D fill:#f9c5b4,stroke:#333;
style E fill:#f9c5b4,stroke:#333;
style F fill:#d4a8e3,stroke:#333;
style G fill:#b4f9f2,stroke:#333;
&lt;/code>&lt;/pre>
&lt;h3 id="detailed-architecture-responsibilities">Detailed Architecture Responsibilities:&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Model Generates Request&lt;/strong>: When an LLM needs external capabilities, it no longer generates JSON for specific APIs, but instead generates an XML message that conforms to the MCP specification, such as &lt;code>&amp;lt;use_mcp_tool&amp;gt;&lt;/code>. This message clearly specifies which &lt;code>server_name&lt;/code> to communicate with and which &lt;code>tool_name&lt;/code> to call.&lt;/li>
&lt;li>&lt;strong>Client Parsing and Routing&lt;/strong>: The MCP client (typically part of the model's runtime environment) captures and parses this XML request. It queries a service registry based on the &lt;code>server_name&lt;/code> to determine whether the target server is a local process or a remote service.&lt;/li>
&lt;li>&lt;strong>Selecting Communication Channel&lt;/strong>:
&lt;ul>
&lt;li>If the target is a &lt;strong>local MCP server&lt;/strong> (e.g., a locally running Python script), the client will communicate with that server process through &lt;strong>standard input/output (stdio)&lt;/strong>.&lt;/li>
&lt;li>If the target is a &lt;strong>remote MCP server&lt;/strong> (e.g., a service deployed in the cloud), the client will establish a connection with it through the &lt;strong>HTTP Server-Sent Events (SSE)&lt;/strong> protocol.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Server Processing Request&lt;/strong>: After receiving the request, the protocol processor on the MCP server calls the specific tool function or resource handler that has been registered internally based on the &lt;code>tool_name&lt;/code> or &lt;code>uri&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Execution and Return&lt;/strong>: The server executes the specific logic (calling APIs, querying databases, etc.) and encapsulates the results in the MCP standard format, returning them to the client through the same route.&lt;/li>
&lt;li>&lt;strong>Result Feedback to Model&lt;/strong>: After receiving the server's response, the client organizes and formats it as the execution result of the external tool, and submits it back to the LLM for the LLM to generate the final natural language reply, completing the entire interaction loop.&lt;/li>
&lt;/ol>
&lt;p>The brilliance of this architecture lies in the fact that the LLM itself is completely decoupled from the physical location and network implementation details of the tools. It only needs to learn to &amp;ldquo;speak&amp;rdquo; the MCP &amp;ldquo;common language&amp;rdquo; to interact with any service in the entire MCP ecosystem.&lt;/p>
&lt;h2 id="3-communication-protocol-deep-dive-mcps-neural-network">3. Communication Protocol Deep Dive: MCP's Neural Network&lt;/h2>
&lt;p>The power of MCP lies in its standardized communication methods. It primarily connects clients and servers through two distinctly different protocols to accommodate different deployment scenarios.&lt;/p>
&lt;h3 id="31-local-communication-standard-inputoutput-stdio">3.1. Local Communication: Standard Input/Output (Stdio)&lt;/h3>
&lt;p>When the MCP server is a local executable file or script (e.g., a Python script, a Go program), the MCP client uses &lt;strong>Standard Input/Output (Stdio)&lt;/strong> for communication. This is a classic and efficient form of inter-process communication (IPC).&lt;/p>
&lt;p>&lt;strong>Workflow Breakdown&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Launch Subprocess&lt;/strong>: The MCP client (such as a VS Code extension) launches the MCP server program as a &lt;strong>subprocess&lt;/strong> (e.g., executing &lt;code>python mcp_server.py&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Pipe Establishment&lt;/strong>: The operating system automatically establishes three pipes between the parent process (client) and child process (server):
&lt;ul>
&lt;li>&lt;code>stdin&lt;/code> (standard input): The channel for the client to send data to the server.&lt;/li>
&lt;li>&lt;code>stdout&lt;/code> (standard output): The channel for the server to send successful results to the client.&lt;/li>
&lt;li>&lt;code>stderr&lt;/code> (standard error): The channel for the server to send error messages to the client.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Message Exchange&lt;/strong>:
&lt;ul>
&lt;li>The client writes the MCP request (e.g., an XML string like &lt;code>&amp;lt;use_mcp_tool&amp;gt;...&lt;/code>) to the server process's &lt;code>stdin&lt;/code>. To handle packet sticking issues, messages are typically delimited by specific separators (such as newline &lt;code>\n&lt;/code>) or length prefixes.&lt;/li>
&lt;li>The server reads and parses the request from its &lt;code>stdout&lt;/code> and executes the corresponding logic.&lt;/li>
&lt;li>The server writes the execution result (also an XML string in MCP format) to its own &lt;code>stdout&lt;/code>.&lt;/li>
&lt;li>If any errors occur during the process, error details are written to &lt;code>stderr&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lifecycle Management&lt;/strong>: The client is responsible for monitoring the lifecycle of the server subprocess and can terminate it when it's no longer needed.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extremely Low Latency&lt;/strong>: Since it's local inter-process communication, there's almost no network overhead.&lt;/li>
&lt;li>&lt;strong>Simple and Reliable&lt;/strong>: Simple implementation, not dependent on the network stack.&lt;/li>
&lt;li>&lt;strong>High Security&lt;/strong>: Data doesn't leave the machine, providing natural isolation.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Applicable Scenarios&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Local tools requiring high performance and high-frequency calls.&lt;/li>
&lt;li>Tools that directly operate on the local file system or hardware.&lt;/li>
&lt;li>Development and debugging environments.&lt;/li>
&lt;/ul>
&lt;h3 id="32-remote-communication-serversent-events-http-sse">3.2. Remote Communication: Server-Sent Events (HTTP SSE)&lt;/h3>
&lt;p>When the MCP server is deployed on a remote host or in the cloud, communication is done through the HTTP-based &lt;strong>Server-Sent Events (SSE)&lt;/strong> protocol. SSE is a web technology that allows servers to push events to clients in a one-way fashion.&lt;/p>
&lt;p>&lt;strong>Workflow Breakdown&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>HTTP Connection&lt;/strong>: The MCP client initiates a regular HTTP GET request to a specific endpoint of the MCP server (e.g., &lt;code>https://api.my-mcp-server.com/v1/mcp&lt;/code>). The key is that the client includes &lt;code>Accept: text/event-stream&lt;/code> in the request header, indicating it wants to establish an SSE connection.&lt;/li>
&lt;li>&lt;strong>Long Connection Maintenance&lt;/strong>: Upon receiving the request, the server doesn't immediately close the connection but keeps it open, forming a &lt;strong>long connection&lt;/strong>. The &lt;code>Content-Type&lt;/code> header of the response is set to &lt;code>text/event-stream&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Event Pushing&lt;/strong>:
&lt;ul>
&lt;li>The client sends the MCP request (XML string) as part of the HTTP POST request body to another endpoint of the server through this long connection.&lt;/li>
&lt;li>After processing the request, the server encapsulates the response data in the SSE event format and &lt;strong>pushes&lt;/strong> it back to the client through the previously established long connection. Each event consists of fields such as &lt;code>event: &amp;lt;event_name&amp;gt;&lt;/code> and &lt;code>data: &amp;lt;event_data&amp;gt;&lt;/code>.&lt;/li>
&lt;li>MCP typically defines different types of events, such as &lt;code>result&lt;/code> for success, &lt;code>error&lt;/code> for failure, and &lt;code>log&lt;/code> for transmitting logs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Cross-Network Communication&lt;/strong>: Can easily connect to servers anywhere.&lt;/li>
&lt;li>&lt;strong>Firewall Penetration&lt;/strong>: Based on standard HTTP(S) protocol, with good network compatibility.&lt;/li>
&lt;li>&lt;strong>Server-Side Push&lt;/strong>: Suitable for scenarios requiring server-initiated notifications.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Applicable Scenarios&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Encapsulating third-party cloud service APIs (such as weather, maps, payments).&lt;/li>
&lt;li>Shared tools that need centralized management and deployment.&lt;/li>
&lt;li>Building publicly accessible tool service ecosystems.&lt;/li>
&lt;/ul>
&lt;h2 id="4-mcp-message-format-breakdown-the-protocols-common-language">4. MCP Message Format Breakdown: The Protocol's &amp;ldquo;Common Language&amp;rdquo;&lt;/h2>
&lt;p>The core of MCP is its XML-based message format that is both human-readable and machine-parsable. Models express their intentions by generating XML fragments in these specific formats.&lt;/p>
&lt;h3 id="41-usemcptool-calling-a-tool">4.1. &lt;code>&amp;lt;use_mcp_tool&amp;gt;&lt;/code>: Calling a Tool&lt;/h3>
&lt;p>This is the most core message, used to request the execution of a defined tool.&lt;/p>
&lt;p>&lt;strong>Structure Example&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-xml">&amp;lt;use_mcp_tool&amp;gt;
&amp;lt;server_name&amp;gt;weather-server&amp;lt;/server_name&amp;gt;
&amp;lt;tool_name&amp;gt;get_forecast&amp;lt;/tool_name&amp;gt;
&amp;lt;arguments&amp;gt;
{
&amp;quot;city&amp;quot;: &amp;quot;San Francisco&amp;quot;,
&amp;quot;days&amp;quot;: 5
}
&amp;lt;/arguments&amp;gt;
&amp;lt;/use_mcp_tool&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Field Details&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>&amp;lt;server_name&amp;gt;&lt;/code> (Required)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Purpose&lt;/strong>: Unique identifier of the MCP server.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: The client uses this name to look up corresponding server information (whether it's a local process or remote URL) in its internal service registry, deciding whether to use Stdio or SSE for communication. This is key to implementing routing.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;tool_name&amp;gt;&lt;/code> (Required)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Purpose&lt;/strong>: Name of the tool to call.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: After receiving the request, the MCP server uses this name to find and execute the corresponding function in its internal tool mapping table.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;arguments&amp;gt;&lt;/code> (Required)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Purpose&lt;/strong>: Parameters needed to call the tool.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: The content is typically a &lt;strong>JSON string&lt;/strong>. The server needs to first parse this string, convert it to a language-native object or dictionary, and then pass it to the specific tool function. This design leverages JSON's powerful data expression capabilities and cross-language universality.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="42-accessmcpresource-accessing-a-resource">4.2. &lt;code>&amp;lt;access_mcp_resource&amp;gt;&lt;/code>: Accessing a Resource&lt;/h3>
&lt;p>In addition to actively &amp;ldquo;executing&amp;rdquo; tools, MCP also supports passively &amp;ldquo;accessing&amp;rdquo; data sources.&lt;/p>
&lt;p>&lt;strong>Structure Example&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-xml">&amp;lt;access_mcp_resource&amp;gt;
&amp;lt;server_name&amp;gt;internal-docs&amp;lt;/server_name&amp;gt;
&amp;lt;uri&amp;gt;doc://product/specs/version-3.md&amp;lt;/uri&amp;gt;
&amp;lt;/access_mcp_resource&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Field Details&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>&amp;lt;server_name&amp;gt;&lt;/code> (Required)&lt;/strong>: Same as above, used for routing.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;uri&amp;gt;&lt;/code> (Required)&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Purpose&lt;/strong>: Uniform Resource Identifier for the resource.&lt;/li>
&lt;li>&lt;strong>Underlying Details&lt;/strong>: The format of the URI (&lt;code>scheme://path&lt;/code>) is defined and interpreted by the server itself. For example:
&lt;ul>
&lt;li>&lt;code>file:///path/to/local/file&lt;/code>: Access a local file.&lt;/li>
&lt;li>&lt;code>db://customers/id/123&lt;/code>: Query a database.&lt;/li>
&lt;li>&lt;code>api://v1/users?active=true&lt;/code>: Access a REST API endpoint.
The server needs to parse this URI and execute the appropriate resource retrieval logic based on its scheme and path.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="5-building-an-mcp-server-from-concept-to-code-skeleton">5. Building an MCP Server: From Concept to Code Skeleton&lt;/h2>
&lt;p>To make the concept more concrete, below is a minimalist Python pseudocode skeleton showing how to implement an MCP server that responds to Stdio communication.&lt;/p>
&lt;pre>&lt;code class="language-python">import sys
import json
import xml.etree.ElementTree as ET
# 1. Define specific tool functions
def get_weather(city: str, days: int = 1):
&amp;quot;&amp;quot;&amp;quot;A simulated weather tool&amp;quot;&amp;quot;&amp;quot;
# In the real world, this would call a weather API
return {&amp;quot;city&amp;quot;: city, &amp;quot;forecast&amp;quot;: f&amp;quot;Sunny for the next {days} days&amp;quot;}
# Map tool names to function objects
AVAILABLE_TOOLS = {
&amp;quot;get_weather&amp;quot;: get_weather
}
# 2. MCP protocol processing main loop
def main_loop():
&amp;quot;&amp;quot;&amp;quot;Read requests from stdin, process them, and write results to stdout&amp;quot;&amp;quot;&amp;quot;
for line in sys.stdin:
request_xml = line.strip()
if not request_xml:
continue
try:
# 3. Parse MCP request
root = ET.fromstring(request_xml)
if root.tag == &amp;quot;use_mcp_tool&amp;quot;:
tool_name = root.find(&amp;quot;tool_name&amp;quot;).text
args_str = root.find(&amp;quot;arguments&amp;quot;).text
args = json.loads(args_str)
# 4. Find and execute the tool
tool_function = AVAILABLE_TOOLS.get(tool_name)
if tool_function:
result = tool_function(**args)
# 5. Encapsulate successful result and write back to stdout
response = {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;, &amp;quot;data&amp;quot;: result}
sys.stdout.write(json.dumps(response) + &amp;quot;\n&amp;quot;)
else:
raise ValueError(f&amp;quot;Tool '{tool_name}' not found.&amp;quot;)
# (Logic for handling access_mcp_resource can be added here)
except Exception as e:
# 6. Write error information back to stderr
error_response = {&amp;quot;status&amp;quot;: &amp;quot;error&amp;quot;, &amp;quot;message&amp;quot;: str(e)}
sys.stderr.write(json.dumps(error_response) + &amp;quot;\n&amp;quot;)
# Flush buffers in real-time to ensure the client receives immediately
sys.stdout.flush()
sys.stderr.flush()
if __name__ == &amp;quot;__main__&amp;quot;:
main_loop()
&lt;/code>&lt;/pre>
&lt;p>This skeleton clearly demonstrates the core responsibilities of an MCP server: listening for input, parsing the protocol, executing logic, and returning results.&lt;/p>
&lt;h2 id="6-practical-exercise-using-the-mcpdriven-context7-server-to-answer-technical-questions">6. Practical Exercise: Using the MCP-Driven context7 Server to Answer Technical Questions&lt;/h2>
&lt;p>After theory and skeleton, let's look at a real, end-to-end example to see how MCP works in practical applications.&lt;/p>
&lt;p>&lt;strong>Scenario&lt;/strong>: We're building an AI programming assistant. When a user asks a specific programming question, we want the AI to provide the most authoritative and accurate answer by querying the latest official documentation, rather than relying on its potentially outdated internal knowledge.&lt;/p>
&lt;p>In this scenario, the &lt;code>context7&lt;/code> MCP server is our &amp;ldquo;external document library.&amp;rdquo;&lt;/p>
&lt;p>Here's the complete interaction flow:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant Agent as AI Programming Assistant (Model+Client)
participant Context7 as context7 MCP Server
User-&amp;gt;&amp;gt;+Agent: Ask about React Hooks differences
Note over Agent: 1. Analyze question, decide to call tool
Agent--&amp;gt;&amp;gt;+Context7: 2. Send MCP request (get-library-docs)
Note over Context7: 3. Query document library
Context7--&amp;gt;&amp;gt;-Agent: 4. Return document summary (key differences)
Note over Agent: 5. Understand and summarize authoritative material
Agent--&amp;gt;&amp;gt;-User: 6. Generate final answer based on documentation
&lt;/code>&lt;/pre>
&lt;h3 id="process-breakdown-and-mcp-value-demonstration">Process Breakdown and MCP Value Demonstration&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Intent to Protocol Conversion&lt;/strong>: The model (LLM) successfully converts the user's natural language question into a structured, standardized MCP request. It not only identifies the need to call a tool but also accurately fills in the &lt;code>server_name&lt;/code>, &lt;code>tool_name&lt;/code>, and &lt;code>arguments&lt;/code>, which is the core capability of an MCP-driven Agent.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Decoupling Advantage&lt;/strong>: The AI programming assistant (client) doesn't need to know at all how the &lt;code>context7&lt;/code> server is implemented. It could be a complex system connected to multiple data sources. But for the assistant, it's just a service endpoint that follows the MCP protocol and can be accessed through the name &lt;code>context7&lt;/code>. This decoupling makes replacing or upgrading the document source extremely simple without needing to modify the Agent's core logic.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Scalability from Standardization&lt;/strong>: Now, if we want to add the ability to query NPM package dependencies to this AI assistant, we just need to develop or integrate another MCP server named &lt;code>npm-analyzer&lt;/code>. The learning cost for the Agent is almost zero because it only needs to learn to generate a new &lt;code>&amp;lt;use_mcp_tool&amp;gt;&lt;/code> request pointing to the new &lt;code>server_name&lt;/code>. The entire system's capabilities can be infinitely expanded like building with Lego blocks.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>This example clearly demonstrates how MCP evolves from a simple &amp;ldquo;function call&amp;rdquo; concept to a powerful, scalable service-oriented architecture, providing a solid foundation for building complex AI applications.&lt;/p>
&lt;h2 id="7-conclusion-mcps-value-and-futurebuilding-the-internet-of-ai">7. Conclusion: MCP's Value and Future—Building the &amp;ldquo;Internet&amp;rdquo; of AI&lt;/h2>
&lt;p>General tool calling gives LLMs the ability to &amp;ldquo;speak&amp;rdquo; and &amp;ldquo;act,&amp;rdquo; while the &lt;strong>Model Context Protocol (MCP) defines the grammar and traffic rules for these abilities&lt;/strong>. Through standardization, decoupling, and service-oriented design principles, MCP transforms isolated AI applications and tools into a potential, interoperable massive network.&lt;/p>
&lt;p>The true value of MCP isn't that it defines another type of RPC (Remote Procedure Call), but that it's specifically tailored for the unique scenario of &lt;strong>AI Agent interaction with the external world&lt;/strong>. It's simple enough for LLMs to easily generate protocol messages, yet powerful enough to support complex, distributed application ecosystems.&lt;/p>
&lt;p>In the future, as the MCP ecosystem matures, we can envision an &amp;ldquo;Internet of AI tools&amp;rdquo;:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tool Marketplace&lt;/strong>: Developers can publish and sell standardized MCP servers, and other applications can purchase and integrate them as needed.&lt;/li>
&lt;li>&lt;strong>Agent Interoperability&lt;/strong>: Intelligent agents developed by different companies based on different underlying models can call each other's capabilities and collaborate on more complex tasks as long as they all &amp;ldquo;speak&amp;rdquo; the MCP language.&lt;/li>
&lt;li>&lt;strong>Dynamic Service Discovery&lt;/strong>: More advanced Agents might be able to dynamically discover and learn new MCP services, continuously expanding their capability boundaries without requiring reprogramming.&lt;/li>
&lt;/ul>
&lt;p>Therefore, understanding and mastering MCP is not just about learning a specific technology, but a key step in gaining insight into and planning for the next generation of AI application architecture.&lt;/p></description></item><item><title>LLM Tool Calling: The Key Technology Breaking AI Capability Boundaries</title><link>https://ziyanglin.netlify.app/en/post/llm-tool-calling/</link><pubDate>Mon, 30 Jun 2025 07:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llm-tool-calling/</guid><description>&lt;h2 id="1-macro-overview-why-tool-calling-is-llms-super-plugin">1. Macro Overview: Why Tool Calling is LLM's &amp;ldquo;Super Plugin&amp;rdquo;&lt;/h2>
&lt;p>The emergence of Large Language Models (LLMs) has fundamentally changed how we interact with machines. However, LLMs have an inherent, unavoidable &amp;ldquo;ceiling&amp;rdquo;: they are essentially &amp;ldquo;probability prediction machines&amp;rdquo; trained on massive text data, with their knowledge frozen at the time their training data ends. This means an LLM cannot know &amp;ldquo;what's the weather like today?&amp;quot;, cannot access your company's internal database, and cannot book a flight ticket for you.&lt;/p>
&lt;p>The &lt;strong>LLM Tool Calling / Function Calling&lt;/strong> mechanism emerged precisely to break through this ceiling. It gives LLMs an unprecedented ability: &lt;strong>calling external tools (APIs, functions, databases, etc.) to obtain real-time information, perform specific tasks, or interact with the external world&lt;/strong> when needed.&lt;/p>
&lt;p>In simple terms, the tool calling mechanism upgrades LLMs from &amp;ldquo;knowledgeable conversationalists&amp;rdquo; to capable &amp;ldquo;intelligent agents.&amp;rdquo; It allows LLMs to:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Obtain real-time information&lt;/strong>: By calling weather APIs, news APIs, search engines, etc., to get the latest information beyond the model's training data.&lt;/li>
&lt;li>&lt;strong>Operate external systems&lt;/strong>: Connect to enterprise CRM/ERP systems to query data, or connect to IoT devices to control smart home appliances.&lt;/li>
&lt;li>&lt;strong>Execute complex tasks&lt;/strong>: Break down complex user instructions (like &amp;ldquo;help me find and book a cheap flight to Shanghai next week&amp;rdquo;) and complete them by calling multiple APIs in combination.&lt;/li>
&lt;li>&lt;strong>Provide more precise, verifiable answers&lt;/strong>: For queries requiring exact calculations or structured data, LLMs can call calculators or databases instead of relying on their potentially inaccurate internal knowledge.&lt;/li>
&lt;/ul>
&lt;p>Therefore, tool calling is not just a simple extension of LLM functionality, but a core foundation for building truly powerful AI applications that deeply integrate with both the physical and digital worlds.&lt;/p>
&lt;h2 id="2-core-concepts-and-workflow-how-do-llms-learn-to-use-tools">2. Core Concepts and Workflow: How Do LLMs &amp;ldquo;Learn&amp;rdquo; to Use Tools?&lt;/h2>
&lt;p>To understand the underlying logic of tool calling, we need to view it as an elegant process involving three core roles working together:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Large Language Model (LLM)&lt;/strong>: The brain and decision-maker.&lt;/li>
&lt;li>&lt;strong>Tool Definitions&lt;/strong>: A detailed &amp;ldquo;tool instruction manual.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Developer/Client-side Code&lt;/strong>: The ultimate &amp;ldquo;executor.&amp;rdquo;&lt;/li>
&lt;/ol>
&lt;p>The LLM itself &lt;strong>never actually executes any code&lt;/strong>. Its only task, after understanding the user's intent and the &amp;ldquo;tool manual&amp;rdquo; it has, is to &lt;strong>generate a JSON data structure that precisely describes which tool should be called and with what parameters&lt;/strong>.&lt;/p>
&lt;p>Below is a visual explanation of this process:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant Client as Client/Application Layer
participant LLM as Large Language Model
participant Tools as External Tools/APIs
User-&amp;gt;&amp;gt;+Client: &amp;quot;What's the weather in Beijing today?&amp;quot;
Client-&amp;gt;&amp;gt;+LLM: Submit user request + Tool Definitions
Note over LLM: 1. Understand user intent&amp;lt;br/&amp;gt;2. Match most appropriate tool (get_weather)&amp;lt;br/&amp;gt;3. Extract required parameters (location: &amp;quot;Beijing&amp;quot;)
LLM--&amp;gt;&amp;gt;-Client: Return JSON: {&amp;quot;tool_calls&amp;quot;: [{&amp;quot;function&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;get_weather&amp;quot;, &amp;quot;arguments&amp;quot;: &amp;quot;{\&amp;quot;location\&amp;quot;: \&amp;quot;Beijing\&amp;quot;}&amp;quot;}}]}
Client-&amp;gt;&amp;gt;+Tools: 2. Based on LLM's JSON, call the actual get_weather(&amp;quot;Beijing&amp;quot;) function
Tools--&amp;gt;&amp;gt;-Client: Return weather data (e.g.: {&amp;quot;temperature&amp;quot;: &amp;quot;25°C&amp;quot;, &amp;quot;condition&amp;quot;: &amp;quot;sunny&amp;quot;})
Client-&amp;gt;&amp;gt;+LLM: 3. Submit tool execution result back to LLM
Note over LLM: 4. Understand the data returned by the tool
LLM--&amp;gt;&amp;gt;-Client: 5. Generate user-friendly natural language response
Client-&amp;gt;&amp;gt;-User: &amp;quot;The weather in Beijing today is sunny with a temperature of 25 degrees Celsius.&amp;quot;
&lt;/code>&lt;/pre>
&lt;h3 id="process-breakdown">Process Breakdown:&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Define &amp;amp; Describe&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Developers first need to define available tools in a structured way (typically using JSON Schema). This &amp;ldquo;manual&amp;rdquo; is crucial to the entire process and must clearly tell the LLM:
&lt;ul>
&lt;li>&lt;strong>Tool name&lt;/strong> (&lt;code>name&lt;/code>): For example, &lt;code>get_weather&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Tool function description&lt;/strong> (&lt;code>description&lt;/code>): For example, &amp;ldquo;Get real-time weather information for a specified city.&amp;rdquo; This is the most important basis for the LLM to understand the tool's purpose.&lt;/li>
&lt;li>&lt;strong>Tool parameters&lt;/strong> (&lt;code>parameters&lt;/code>): Detailed definition of what inputs the tool needs, including each input's name, type (string, number, boolean, etc.), whether it's required, and parameter descriptions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Intent Recognition &amp;amp; Parameter Extraction&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>When a user makes a request (e.g., &amp;ldquo;Check the weather in Beijing&amp;rdquo;), the developer's application sends the user's original request &lt;strong>along with all the tool definitions from step 1&lt;/strong> to the LLM.&lt;/li>
&lt;li>The LLM's core task is to do two things:
&lt;ul>
&lt;li>&lt;strong>Intent Recognition&lt;/strong>: Among all available tools, determine which tool's function description best matches the user's request. In this example, it would match &lt;code>get_weather&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Parameter Extraction&lt;/strong>: From the user's request, identify and extract values that satisfy the tool's parameter requirements. Here, it would recognize that the &lt;code>location&lt;/code> parameter value is &amp;ldquo;Beijing&amp;rdquo;.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>After completing these two steps, the LLM generates one or more &lt;code>tool_calls&lt;/code> objects, essentially saying &amp;ldquo;I suggest you call the function named &lt;code>get_weather&lt;/code> and pass in the parameter &lt;code>{ &amp;quot;location&amp;quot;: &amp;quot;Beijing&amp;quot; }&lt;/code>&amp;rdquo;.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Execute &amp;amp; Observe&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>The developer's application code receives the JSON returned by the LLM and parses this &amp;ldquo;call suggestion.&amp;rdquo;&lt;/li>
&lt;li>The application code &lt;strong>actually executes&lt;/strong> the &lt;code>get_weather(&amp;quot;Beijing&amp;quot;)&lt;/code> function locally or on the server side.&lt;/li>
&lt;li>After execution, it gets a real return result, such as a JSON object containing weather information.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Summarize &amp;amp; Respond&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>To complete the loop, the application layer needs to submit the actual execution result from the previous step back to the LLM.&lt;/li>
&lt;li>This time, the LLM's task is to understand this raw data returned by the tool (e.g., &lt;code>{&amp;quot;temperature&amp;quot;: &amp;quot;25°C&amp;quot;, &amp;quot;condition&amp;quot;: &amp;quot;sunny&amp;quot;}&lt;/code>) and convert it into a fluent, natural, user-friendly response.&lt;/li>
&lt;li>Finally, the user receives the reply &amp;ldquo;The weather in Beijing today is sunny with a temperature of 25 degrees Celsius,&amp;rdquo; and the entire process is complete.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>This process elegantly combines the LLM's powerful natural language understanding ability with the external tool's powerful functional execution capability, achieving a 1+1&amp;gt;2 effect.&lt;/p>
&lt;h2 id="3-technical-deep-dive-analyzing-the-industry-standard-openai-tool-calling">3. Technical Deep Dive: Analyzing the Industry Standard (OpenAI Tool Calling)&lt;/h2>
&lt;p>OpenAI's API is currently the de facto standard in the field of LLM tool calling, and its design is widely emulated. Understanding its implementation details is crucial for any developer looking to integrate LLM tool calling into their applications.&lt;/p>
&lt;h3 id="31-core-api-parameters">3.1. Core API Parameters&lt;/h3>
&lt;p>When calling OpenAI's Chat Completions API, there are two main parameters related to tool calling: &lt;code>tools&lt;/code> and &lt;code>tool_choice&lt;/code>.&lt;/p>
&lt;h4 id="tools-parameter-your-toolbox">&lt;code>tools&lt;/code> Parameter: Your &amp;ldquo;Toolbox&amp;rdquo;&lt;/h4>
&lt;p>The &lt;code>tools&lt;/code> parameter is an array where you can define one or more tools. Each tool follows a fixed structure, with the core being a &lt;code>function&lt;/code> object defined based on the &lt;strong>JSON Schema&lt;/strong> specification.&lt;/p>
&lt;p>&lt;strong>Example: Defining a weather tool and a flight booking tool&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-json">[
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;get_current_weather&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Get real-time weather information for a specified location&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;location&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;City and state/province name, e.g., 'San Francisco, CA'&amp;quot;
},
&amp;quot;unit&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;enum&amp;quot;: [&amp;quot;celsius&amp;quot;, &amp;quot;fahrenheit&amp;quot;],
&amp;quot;description&amp;quot;: &amp;quot;Temperature unit&amp;quot;
}
},
&amp;quot;required&amp;quot;: [&amp;quot;location&amp;quot;]
}
}
},
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;book_flight&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Book a flight ticket for the user from departure to destination&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;departure&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Departure airport or city&amp;quot;
},
&amp;quot;destination&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Destination airport or city&amp;quot;
},
&amp;quot;date&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Desired departure date in YYYY-MM-DD format&amp;quot;
}
},
&amp;quot;required&amp;quot;: [&amp;quot;departure&amp;quot;, &amp;quot;destination&amp;quot;, &amp;quot;date&amp;quot;]
}
}
}
]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Key Points Analysis&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>type&lt;/code>&lt;/strong>: Currently fixed as &lt;code>&amp;quot;function&amp;quot;&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>function.name&lt;/code>&lt;/strong>: Function name. Must be a combination of letters, numbers, and underscores, not exceeding 64 characters. This is the key for your code to identify which function to call.&lt;/li>
&lt;li>&lt;strong>&lt;code>function.description&lt;/code>&lt;/strong>: &lt;strong>Critically important&lt;/strong>. This is the main basis for the LLM to decide whether to select this tool. The description should clearly, accurately, and unambiguously explain what the function does. A good description can greatly improve the LLM's call accuracy.&lt;/li>
&lt;li>&lt;strong>&lt;code>function.parameters&lt;/code>&lt;/strong>: A standard JSON Schema object.
&lt;ul>
&lt;li>&lt;strong>&lt;code>type&lt;/code>&lt;/strong>: Must be &lt;code>&amp;quot;object&amp;quot;&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>properties&lt;/code>&lt;/strong>: Defines each parameter's name, type (&lt;code>string&lt;/code>, &lt;code>number&lt;/code>, &lt;code>boolean&lt;/code>, &lt;code>array&lt;/code>, &lt;code>object&lt;/code>), and description. The parameter description is equally important as it helps the LLM understand what information to extract from user input to fill this parameter.&lt;/li>
&lt;li>&lt;strong>&lt;code>required&lt;/code>&lt;/strong>: An array of strings listing which parameters are mandatory. If the user request lacks necessary information, the LLM might ask follow-up questions or choose not to call the tool.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="toolchoice-parameter-controlling-the-llms-choice">&lt;code>tool_choice&lt;/code> Parameter: Controlling the LLM's Choice&lt;/h4>
&lt;p>By default, the LLM decides on its own whether to respond with text or call one or more tools based on the user's input. The &lt;code>tool_choice&lt;/code> parameter allows you to control this behavior more precisely.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>&amp;quot;none&amp;quot;&lt;/code>&lt;/strong>: Forces the LLM not to call any tools and directly return a text response.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;quot;auto&amp;quot;&lt;/code>&lt;/strong> (default): The LLM can freely choose whether to respond with text or call tools.&lt;/li>
&lt;li>&lt;strong>&lt;code>{&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;, &amp;quot;function&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;my_function&amp;quot;}}&lt;/code>&lt;/strong>: Forces the LLM to call this specific tool named &lt;code>my_function&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>This parameter is very useful in scenarios where you need to enforce a specific process or limit the LLM's capabilities.&lt;/p>
&lt;h3 id="32-requestresponse-lifecycle">3.2. Request-Response Lifecycle&lt;/h3>
&lt;p>A complete tool calling interaction involves at least two API requests.&lt;/p>
&lt;p>&lt;strong>First Request: From User to LLM&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python"># request
response = client.chat.completions.create(
model=&amp;quot;gpt-4o&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Please book me a flight from New York to London tomorrow&amp;quot;}],
tools=my_tools, # The tool list defined above
tool_choice=&amp;quot;auto&amp;quot;
)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>First Response: LLM's &amp;ldquo;Call Suggestion&amp;rdquo;&lt;/strong>&lt;/p>
&lt;p>If the LLM decides to call a tool, the API response's &lt;code>finish_reason&lt;/code> will be &lt;code>tool_calls&lt;/code>, and the &lt;code>message&lt;/code> object will contain a &lt;code>tool_calls&lt;/code> array.&lt;/p>
&lt;pre>&lt;code class="language-json">{
&amp;quot;choices&amp;quot;: [
{
&amp;quot;finish_reason&amp;quot;: &amp;quot;tool_calls&amp;quot;,
&amp;quot;message&amp;quot;: {
&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;,
&amp;quot;content&amp;quot;: null,
&amp;quot;tool_calls&amp;quot;: [
{
&amp;quot;id&amp;quot;: &amp;quot;call_abc123&amp;quot;,
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;book_flight&amp;quot;,
&amp;quot;arguments&amp;quot;: &amp;quot;{\&amp;quot;departure\&amp;quot;:\&amp;quot;New York\&amp;quot;,\&amp;quot;destination\&amp;quot;:\&amp;quot;London\&amp;quot;,\&amp;quot;date\&amp;quot;:\&amp;quot;2025-07-01\&amp;quot;}&amp;quot;
}
}
]
}
}
],
...
}
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Key Points Analysis&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>finish_reason&lt;/code>&lt;/strong>: A value of &lt;code>&amp;quot;tool_calls&amp;quot;&lt;/code> indicates that the LLM wants you to execute a tool call, rather than ending the conversation.&lt;/li>
&lt;li>&lt;strong>&lt;code>message.role&lt;/code>&lt;/strong>: &lt;code>assistant&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>message.tool_calls&lt;/code>&lt;/strong>: This is an array, meaning the LLM can request multiple tool calls at once.
&lt;ul>
&lt;li>&lt;strong>&lt;code>id&lt;/code>&lt;/strong>: A unique call ID. In subsequent requests, you'll need to use this ID to associate the tool's execution results.&lt;/li>
&lt;li>&lt;strong>&lt;code>function.name&lt;/code>&lt;/strong>: The function name the LLM suggests calling.&lt;/li>
&lt;li>&lt;strong>&lt;code>function.arguments&lt;/code>&lt;/strong>: &lt;strong>A JSON object in string form&lt;/strong>. You need to parse this string to get the specific parameters needed to call the function.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Second Request: Returning Tool Results to the LLM&lt;/strong>&lt;/p>
&lt;p>After executing the tool in your code, you need to send the results back to the LLM to complete the conversation. At this point, you need to construct a new &lt;code>messages&lt;/code> list that includes:&lt;/p>
&lt;ol>
&lt;li>The original user message.&lt;/li>
&lt;li>The &lt;code>assistant&lt;/code> message returned by the LLM in the previous step (containing &lt;code>tool_calls&lt;/code>).&lt;/li>
&lt;li>A new message with the &lt;code>tool&lt;/code> role, containing the tool's execution results.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-python"># message history
messages = [
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Please book me a flight from New York to London tomorrow&amp;quot;},
response.choices[0].message, # Assistant's 'tool_calls' message
{
&amp;quot;tool_call_id&amp;quot;: &amp;quot;call_abc123&amp;quot;, # Must match the ID from the previous step
&amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,
&amp;quot;name&amp;quot;: &amp;quot;book_flight&amp;quot;,
&amp;quot;content&amp;quot;: &amp;quot;{\&amp;quot;status\&amp;quot;: \&amp;quot;success\&amp;quot;, \&amp;quot;ticket_id\&amp;quot;: \&amp;quot;TICKET-45678\&amp;quot;}&amp;quot; # Actual return value from the tool
}
]
# second request
second_response = client.chat.completions.create(
model=&amp;quot;gpt-4o&amp;quot;,
messages=messages
)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Second Response: LLM's Final Reply&lt;/strong>&lt;/p>
&lt;p>This time, the LLM will generate a natural language response for the user based on the tool's returned results.&lt;/p>
&lt;pre>&lt;code class="language-json">{
&amp;quot;choices&amp;quot;: [
{
&amp;quot;finish_reason&amp;quot;: &amp;quot;stop&amp;quot;,
&amp;quot;message&amp;quot;: {
&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;,
&amp;quot;content&amp;quot;: &amp;quot;Great! I've booked your flight from New York to London for tomorrow. Your ticket ID is TICKET-45678.&amp;quot;
}
}
],
...
}
&lt;/code>&lt;/pre>
&lt;p>With this, a complete tool calling cycle is finished.&lt;/p>
&lt;h2 id="4-code-implementation-a-complete-python-example">4. Code Implementation: A Complete Python Example&lt;/h2>
&lt;p>Below is an end-to-end Python example using OpenAI's Python library to demonstrate how to implement a weather query feature.&lt;/p>
&lt;pre>&lt;code class="language-python">import os
import json
from openai import OpenAI
from dotenv import load_dotenv
# --- 1. Initial Setup ---
load_dotenv() # Load environment variables from .env file
client = OpenAI(api_key=os.getenv(&amp;quot;OPENAI_API_KEY&amp;quot;))
# --- 2. Define Our Local Tool Functions ---
# This is a mock function; in a real application, it would call an actual weather API
def get_current_weather(location, unit=&amp;quot;celsius&amp;quot;):
&amp;quot;&amp;quot;&amp;quot;Get real-time weather information for a specified location&amp;quot;&amp;quot;&amp;quot;
if &amp;quot;New York&amp;quot; in location:
return json.dumps({
&amp;quot;location&amp;quot;: &amp;quot;New York&amp;quot;,
&amp;quot;temperature&amp;quot;: &amp;quot;10&amp;quot;,
&amp;quot;unit&amp;quot;: unit,
&amp;quot;forecast&amp;quot;: [&amp;quot;sunny&amp;quot;, &amp;quot;light breeze&amp;quot;]
})
elif &amp;quot;London&amp;quot; in location:
return json.dumps({
&amp;quot;location&amp;quot;: &amp;quot;London&amp;quot;,
&amp;quot;temperature&amp;quot;: &amp;quot;15&amp;quot;,
&amp;quot;unit&amp;quot;: unit,
&amp;quot;forecast&amp;quot;: [&amp;quot;light rain&amp;quot;, &amp;quot;northeast wind&amp;quot;]
})
else:
return json.dumps({&amp;quot;location&amp;quot;: location, &amp;quot;temperature&amp;quot;: &amp;quot;unknown&amp;quot;})
# --- 3. Main Execution Flow ---
def run_conversation(user_prompt: str):
print(f&amp;quot;👤 User: {user_prompt}&amp;quot;)
# Step 1: Send the user's message and tool definitions to the LLM
messages = [{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: user_prompt}]
tools = [
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;get_current_weather&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Get real-time weather information for a specified city&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;location&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;City name, e.g., New York City&amp;quot;,
},
&amp;quot;unit&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;enum&amp;quot;: [&amp;quot;celsius&amp;quot;, &amp;quot;fahrenheit&amp;quot;]},
},
&amp;quot;required&amp;quot;: [&amp;quot;location&amp;quot;],
},
},
}
]
response = client.chat.completions.create(
model=&amp;quot;gpt-4o&amp;quot;,
messages=messages,
tools=tools,
tool_choice=&amp;quot;auto&amp;quot;,
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Step 2: Check if the LLM decided to call a tool
if tool_calls:
print(f&amp;quot;🤖 LLM decided to call tool: {tool_calls[0].function.name}&amp;quot;)
# Add the LLM's reply to the message history
messages.append(response_message)
# Step 3: Execute the tool call
# Note: This example only handles the first tool call
tool_call = tool_calls[0]
function_name = tool_call.function.name
function_to_call = globals().get(function_name) # Get the function from the global scope
if not function_to_call:
print(f&amp;quot;❌ Error: Function {function_name} is not defined&amp;quot;)
return
function_args = json.loads(tool_call.function.arguments)
# Call the function and get the result
function_response = function_to_call(
location=function_args.get(&amp;quot;location&amp;quot;),
unit=function_args.get(&amp;quot;unit&amp;quot;),
)
print(f&amp;quot;🛠️ Tool '{function_name}' returned: {function_response}&amp;quot;)
# Step 4: Return the tool's execution result to the LLM
messages.append(
{
&amp;quot;tool_call_id&amp;quot;: tool_call.id,
&amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,
&amp;quot;name&amp;quot;: function_name,
&amp;quot;content&amp;quot;: function_response,
}
)
print(&amp;quot;🗣️ Submitting tool result back to LLM, generating final response...&amp;quot;)
second_response = client.chat.completions.create(
model=&amp;quot;gpt-4o&amp;quot;,
messages=messages,
)
final_response = second_response.choices[0].message.content
print(f&amp;quot;🤖 LLM final response: {final_response}&amp;quot;)
return final_response
else:
# If the LLM didn't call any tools, directly return its text content
final_response = response_message.content
print(f&amp;quot;🤖 LLM direct response: {final_response}&amp;quot;)
return final_response
# --- Run Examples ---
if __name__ == &amp;quot;__main__&amp;quot;:
run_conversation(&amp;quot;What's the weather like in London today?&amp;quot;)
print(&amp;quot;\n&amp;quot; + &amp;quot;=&amp;quot;*50 + &amp;quot;\n&amp;quot;)
run_conversation(&amp;quot;How are you?&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>This example clearly demonstrates the entire process from defining tools, sending requests, handling &lt;code>tool_calls&lt;/code>, executing local functions, to sending results back to the model to get the final answer.&lt;/p>
&lt;h2 id="5-advanced-topics-and-best-practices">5. Advanced Topics and Best Practices&lt;/h2>
&lt;p>After mastering the basic process, we need to understand some advanced usage and design principles to build more robust and reliable tool calling systems.&lt;/p>
&lt;h3 id="51-parallel-tool-calling">5.1. Parallel Tool Calling&lt;/h3>
&lt;p>Newer models (like &lt;code>gpt-4o&lt;/code>) support parallel tool calling. This means the model can request multiple different, independent tools to be called in a single response.&lt;/p>
&lt;p>&lt;strong>Scenario Example&lt;/strong>: User asks: &amp;ldquo;What's the weather like in New York and London today?&amp;rdquo;&lt;/p>
&lt;p>The model might return a response containing two &lt;code>tool_calls&lt;/code>:&lt;/p>
&lt;ol>
&lt;li>&lt;code>get_current_weather(location=&amp;quot;New York&amp;quot;)&lt;/code>&lt;/li>
&lt;li>&lt;code>get_current_weather(location=&amp;quot;London&amp;quot;)&lt;/code>&lt;/li>
&lt;/ol>
&lt;p>Your code needs to be able to iterate through each &lt;code>tool_call&lt;/code> object in the &lt;code>message.tool_calls&lt;/code> array, execute them separately, collect all results, and then submit these results together in a new request to the model.&lt;/p>
&lt;p>&lt;strong>Code Handling Logic&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python"># ... (received response_message containing multiple tool_calls)
messages.append(response_message) # Add assistant's reply to messages
# Execute functions for each tool call and collect results
tool_outputs = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
output = function_to_call(**function_args)
tool_outputs.append({
&amp;quot;tool_call_id&amp;quot;: tool_call.id,
&amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,
&amp;quot;name&amp;quot;: function_name,
&amp;quot;content&amp;quot;: output,
})
# Add all tool outputs to the message history
messages.extend(tool_outputs)
# Call the model again
second_response = client.chat.completions.create(
model=&amp;quot;gpt-4o&amp;quot;,
messages=messages
)
&lt;/code>&lt;/pre>
&lt;h3 id="52-error-handling">5.2. Error Handling&lt;/h3>
&lt;p>Tool calls are not always successful. APIs might time out, databases might be unreachable, or the function execution itself might throw exceptions. Gracefully handling these errors is crucial.&lt;/p>
&lt;p>When a tool execution fails, you should catch the exception and return structured information describing the error as the result of the tool call to the LLM.&lt;/p>
&lt;p>&lt;strong>Example&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">try:
# Try to call the API
result = some_flaky_api()
content = json.dumps({&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;, &amp;quot;data&amp;quot;: result})
except Exception as e:
# If it fails, return error information
content = json.dumps({&amp;quot;status&amp;quot;: &amp;quot;error&amp;quot;, &amp;quot;message&amp;quot;: f&amp;quot;API call failed: {str(e)}&amp;quot;})
# Return the result (whether successful or failed) to the LLM
messages.append({
&amp;quot;tool_call_id&amp;quot;: tool_call.id,
&amp;quot;role&amp;quot;: &amp;quot;tool&amp;quot;,
&amp;quot;name&amp;quot;: function_name,
&amp;quot;content&amp;quot;: content,
})
&lt;/code>&lt;/pre>
&lt;p>When the LLM receives error information, it typically responds to the user with an apologetic answer that reflects the problem (e.g., &amp;ldquo;Sorry, I'm currently unable to retrieve weather information. Please try again later.&amp;quot;) rather than causing the entire application to crash.&lt;/p>
&lt;h3 id="53-designing-effective-tool-descriptions">5.3. Designing Effective Tool Descriptions&lt;/h3>
&lt;p>&lt;strong>The quality of the tool description (&lt;code>description&lt;/code>) directly determines the LLM's call accuracy.&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Clear and Specific&lt;/strong>: Avoid using vague terms.
&lt;ul>
&lt;li>&lt;strong>Bad&lt;/strong>: &amp;ldquo;Get data&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Good&lt;/strong>: &amp;ldquo;Query the user's order history from the company's CRM system based on user ID&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Include Key Information and Limitations&lt;/strong>: If the tool has specific limitations, be sure to mention them in the description.
&lt;ul>
&lt;li>&lt;strong>Example&lt;/strong>: &amp;ldquo;Query flight information. Note: This tool can only query flights within the next 30 days and cannot query historical flights.&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Start with a Verb&lt;/strong>: Use a clear verb to describe the core functionality of the function.&lt;/li>
&lt;li>&lt;strong>Clear Parameter Descriptions&lt;/strong>: The &lt;code>description&lt;/code> of parameters is equally important; it guides the LLM on how to correctly extract information from user conversations.
&lt;ul>
&lt;li>&lt;strong>Bad&lt;/strong>: &lt;code>&amp;quot;date&amp;quot;: &amp;quot;A date&amp;quot;&lt;/code>&lt;/li>
&lt;li>&lt;strong>Good&lt;/strong>: &lt;code>&amp;quot;date&amp;quot;: &amp;quot;Booking date, must be a string in YYYY-MM-DD format&amp;quot;&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="54-security-considerations">5.4. Security Considerations&lt;/h3>
&lt;p>Giving LLMs the ability to call code is a double-edged sword and must be handled with caution.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Never Execute Code Generated by LLMs&lt;/strong>: The LLM's output is a &amp;ldquo;call suggestion,&amp;rdquo; not executable code. Never use &lt;code>eval()&lt;/code> or similar methods to directly execute strings generated by LLMs. You should parse the suggested function name and parameters, then call your pre-defined, safe, and trusted local functions.&lt;/li>
&lt;li>&lt;strong>Confirmation and Authorization&lt;/strong>: For operations with serious consequences (like deleting data, sending emails, making payments), implement a confirmation mechanism before execution. This could be forcing user confirmation at the code level or having the LLM generate a confirmation message after generating the call suggestion.&lt;/li>
&lt;li>&lt;strong>Principle of Least Privilege&lt;/strong>: Only provide the LLM with the minimum tools necessary to complete its task. Don't expose your entire codebase or irrelevant APIs.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion-and-future-outlook">6. Conclusion and Future Outlook&lt;/h2>
&lt;p>LLM tool calling is one of the most breakthrough advances in artificial intelligence in recent years. It transforms LLMs from closed &amp;ldquo;language brains&amp;rdquo; into open, extensible &amp;ldquo;intelligent agent&amp;rdquo; cores capable of interacting with the world. By combining the powerful natural language understanding capabilities of LLMs with the unlimited functionality of external tools, we can build unprecedented intelligent applications.&lt;/p>
&lt;p>From querying weather and booking hotels to controlling smart homes, analyzing corporate financial reports, and automating software development processes, tool calling is unlocking countless possibilities. As model capabilities continue to strengthen, tool description understanding will become more precise, multi-tool coordination will become more complex and intelligent, and error handling and self-correction capabilities will become stronger.&lt;/p>
&lt;p>In the future, we may see more complex Agentic architectures where LLMs not only call tools but can dynamically create, combine, and even optimize tools. Mastering the principles and practices of LLM tool calling is not only an essential skill to keep up with the current AI technology wave but also a key to future intelligent application development.&lt;/p></description></item><item><title>TensorRT In-Depth: High-Performance Deep Learning Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</link><pubDate>Mon, 30 Jun 2025 06:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/tensorrt-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>NVIDIA® TensorRT™ is a software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It is designed to optimize and accelerate trained neural networks, enabling them to run in production environments with low latency and high throughput. TensorRT takes models from mainstream deep learning frameworks (such as TensorFlow, PyTorch, ONNX, etc.), applies a series of sophisticated optimization techniques, and generates a highly optimized runtime engine.&lt;/p>
&lt;p>This document will provide an in-depth yet accessible introduction to TensorRT's core concepts, key features, workflow, and latest functionalities (including TensorRT-LLM specifically designed for accelerating large language models), helping developers fully leverage its powerful performance advantages.&lt;/p>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>Understanding TensorRT's core components is the first step to using it effectively.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Engine&lt;/strong>: The core of TensorRT. It is an optimized model representation that includes a computation graph and weights generated for a specific GPU architecture and configuration (such as batch size, precision). The Engine is immutable and is the final product for deployment.&lt;/li>
&lt;li>&lt;strong>Builder (&lt;code>IBuilder&lt;/code>)&lt;/strong>: This is the main interface for creating an Engine. The Builder takes a network definition and applies various optimizations, ultimately generating an optimized plan for the target GPU, which can be serialized into an Engine.&lt;/li>
&lt;li>&lt;strong>Network Definition (&lt;code>INetworkDefinition&lt;/code>)&lt;/strong>: This is where you define the model structure. You can build the network manually from scratch or import it from a model file using a Parser.&lt;/li>
&lt;li>&lt;strong>Parser&lt;/strong>: Used to parse models from different frameworks (primarily ONNX format) and convert them into TensorRT's network definition. TensorRT provides a powerful ONNX parser.&lt;/li>
&lt;li>&lt;strong>Profiler (&lt;code>IProfiler&lt;/code>)&lt;/strong>: An optional interface that allows you to collect and query information about layer performance during the build process. This helps with debugging and understanding which layers are performance bottlenecks.&lt;/li>
&lt;li>&lt;strong>Execution Context (&lt;code>IExecutionContext&lt;/code>)&lt;/strong>: This is the main interface for executing inference. An Engine can have multiple Execution Contexts, allowing concurrent execution of inference tasks. Each context maintains its own inputs, outputs, and state.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Model Building Offline&amp;quot;
A[Original Model&amp;lt;br&amp;gt;TensorFlow/PyTorch] --&amp;gt; B{ONNX Parser};
B --&amp;gt; C[Network Definition];
C --&amp;gt; D[Builder];
D -- Optimization Config --&amp;gt; E[Optimized Plan];
E --&amp;gt; F((Engine));
end
subgraph &amp;quot;Inference Deployment Online&amp;quot;
F --&amp;gt; G[Execution Context];
H[Input Data] --&amp;gt; G;
G --&amp;gt; I[Output Results];
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#ccf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h2 id="3-key-features-and-optimization-techniques">3. Key Features and Optimization Techniques&lt;/h2>
&lt;p>TensorRT's high performance stems from its advanced optimization techniques.&lt;/p>
&lt;h3 id="31-precision-calibration--quantization">3.1. Precision Calibration &amp;amp; Quantization&lt;/h3>
&lt;p>TensorRT supports multiple precisions for inference, including FP32, FP16, INT8, and the latest FP8. Among these, INT8 quantization is a key technology for improving performance and reducing memory usage.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: Determines the scaling factors needed to convert FP32 weights and activation values to INT8 through a calibration dataset, without retraining the model.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: Simulates quantization operations during training, making the model more robust to quantization errors, thus achieving higher accuracy when converted to INT8.&lt;/li>
&lt;/ul>
&lt;p>You can use &lt;code>QuantizationSpec&lt;/code> to precisely control which layers or types of layers need to be quantized.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Only quantize 'Conv2D' type layers
q_spec = QuantizationSpec()
q_spec.add(name='Conv2D', is_keras_class=True)
q_model = quantize_model(model, quantization_mode='partial', quantization_spec=q_spec)
&lt;/code>&lt;/pre>
&lt;h3 id="32-layer--tensor-fusion">3.2. Layer &amp;amp; Tensor Fusion&lt;/h3>
&lt;p>TensorRT intelligently merges multiple independent layers into a single, more complex layer. This reduces the number of CUDA kernel launches and memory reads/writes, significantly lowering latency.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Vertical Fusion&lt;/strong>: Merges consecutive layers with the same data dependencies (such as Conv, Bias, ReLU) into a single CBR layer.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv);
B --&amp;gt; C(Bias);
C --&amp;gt; D(ReLU);
D --&amp;gt; E[Output];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv + Bias + ReLU));
F --&amp;gt; E2[Output];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Horizontal Fusion&lt;/strong>: Merges parallel layers that have the same input but perform different operations.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Before Fusion&amp;quot;
A[Input] --&amp;gt; B(Conv A);
A --&amp;gt; C(Conv B);
B --&amp;gt; D[Output A];
C --&amp;gt; E[Output B];
end
subgraph &amp;quot;After Fusion&amp;quot;
A2[Input] --&amp;gt; F((Conv A + Conv B));
F --&amp;gt; D2[Output A];
F --&amp;gt; E2[Output B];
end
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="33-kernel-autotuning">3.3. Kernel Auto-Tuning&lt;/h3>
&lt;p>For specific target GPU architectures, TensorRT selects the optimal CUDA kernel for each layer from a library containing multiple implementations. It tests different algorithms and implementations based on the current batch size, input dimensions, and parameters to find the fastest one.&lt;/p>
&lt;h3 id="34-dynamic-shapes">3.4. Dynamic Shapes&lt;/h3>
&lt;p>TensorRT can handle models with input tensor dimensions that vary at runtime. When building an Engine, you can specify an optimization profile that includes minimum, optimal, and maximum dimensions for inputs. TensorRT will generate an Engine that can efficiently handle any input dimensions within the specified range.&lt;/p>
&lt;h3 id="35-plugins">3.5. Plugins&lt;/h3>
&lt;p>For custom or special layers not natively supported by TensorRT, you can implement your own logic through the plugin API (&lt;code>IPluginV2&lt;/code>). This provides great extensibility for TensorRT.&lt;/p>
&lt;p>The latest versions of TensorRT have greatly simplified the plugin registration process through decorators, especially for the Python API.&lt;/p>
&lt;pre>&lt;code class="language-python"># Example: Register a simple element-wise addition plugin
import tensorrt.plugin as trtp
@trtp.register(&amp;quot;sample::elemwise_add_plugin&amp;quot;)
def add_plugin_desc(inp0: trtp.TensorDesc, block_size: int) -&amp;gt; trtp.TensorDesc:
return inp0.like()
&lt;/code>&lt;/pre>
&lt;h3 id="36-sparsity">3.6. Sparsity&lt;/h3>
&lt;p>TensorRT supports leveraging structured sparsity features on NVIDIA Ampere and higher architecture GPUs. If your model weights have a 2:4 sparsity pattern, TensorRT can utilize sparse tensor cores to further accelerate computation, nearly doubling performance.&lt;/p>
&lt;h2 id="4-workflow">4. Workflow&lt;/h2>
&lt;p>A typical TensorRT deployment workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant TF as TensorFlow/PyTorch
participant ONNX
participant Poly as Polygraphy
participant TRT as TensorRT (trtexec/API)
participant App as Application
D-&amp;gt;&amp;gt;TF: Train Model
TF--&amp;gt;&amp;gt;D: Generate Trained Model
D-&amp;gt;&amp;gt;ONNX: Export to ONNX Format
ONNX--&amp;gt;&amp;gt;D: .onnx File
D-&amp;gt;&amp;gt;Poly: Use Polygraphy to Check and Optimize
Poly--&amp;gt;&amp;gt;D: Optimized .onnx File
D-&amp;gt;&amp;gt;TRT: Build Engine (FP16/INT8)
TRT--&amp;gt;&amp;gt;D: Generate .engine File
D-&amp;gt;&amp;gt;App: Deploy Engine
App-&amp;gt;&amp;gt;App: Load Engine and Create Execution Context
loop Inference Loop
App-&amp;gt;&amp;gt;App: Prepare Input Data
App-&amp;gt;&amp;gt;App: Execute Inference
App-&amp;gt;&amp;gt;App: Get Output Results
end
&lt;/code>&lt;/pre>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Model Export&lt;/strong>: Export your trained model from your training framework (such as PyTorch or TensorFlow) to ONNX format. ONNX is an open model exchange format that serves as a bridge between training and inference.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Model Inspection and Optimization (Polygraphy)&lt;/strong>: Before building an Engine, it is strongly recommended to use the &lt;strong>Polygraphy&lt;/strong> toolkit to inspect, modify, and optimize your ONNX model. Polygraphy is a powerful tool that can:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Inspect Models&lt;/strong>: Display information about the model's layers, inputs, outputs, etc.&lt;/li>
&lt;li>&lt;strong>Constant Folding&lt;/strong>: Pre-compute constant expressions in the model, simplifying the computation graph.
&lt;pre>&lt;code class="language-bash">polygraphy surgeon sanitize model.onnx -o folded.onnx --fold-constants
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Compare Outputs from Different Frameworks&lt;/strong>: Verify that TensorRT's output is consistent with the original framework (such as ONNX Runtime) to troubleshoot precision issues.
&lt;pre>&lt;code class="language-bash">polygraphy run model.onnx --trt --onnxrt
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Handle Data-Dependent Shapes (DDS)&lt;/strong>: Identify and set upper bounds for tensors with data-dependent shapes.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Build Engine&lt;/strong>: Use the &lt;code>trtexec&lt;/code> command-line tool or TensorRT's C++/Python API to build an Engine.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>trtexec&lt;/code>&lt;/strong>: A convenient command-line tool for quickly building an Engine from an ONNX file and conducting performance benchmarking.
&lt;pre>&lt;code class="language-bash">trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>API&lt;/strong>: Provides more flexible control, such as defining optimization profiles for dynamic shapes, configuring plugins, etc.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Deployment and Inference&lt;/strong>: Load the serialized Engine file into your application and use an Execution Context to perform inference.&lt;/p>
&lt;pre>&lt;code class="language-python"># Using Polygraphy's TrtRunner for inference
from polygraphy.backend.trt import TrtRunner, EngineFromBytes
# Load Engine
engine = EngineFromBytes(open(&amp;quot;model.engine&amp;quot;, &amp;quot;rb&amp;quot;).read())
with TrtRunner(engine) as runner:
# Prepare input data
feed_dict = {&amp;quot;input_name&amp;quot;: input_data}
# Execute inference
outputs = runner.infer(feed_dict=feed_dict)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h2 id="5-latest-feature-highlights">5. Latest Feature Highlights&lt;/h2>
&lt;p>TensorRT is rapidly iterating, and here are some of the latest important features:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Polygraphy Tool Enhancements&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Simplified CLI Syntax&lt;/strong>: Allows specifying both script and function name in a single parameter (&lt;code>my_script.py:my_func&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Improved Input Specification&lt;/strong>: Uses a new list-style syntax (&lt;code>--input-shapes input0:[x,y,z]&lt;/code>) to avoid ambiguity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Quickly Deployable Plugins&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>The Python API has introduced &lt;a href="mailto:%60@trtp.register">`@trtp.register&lt;/a>&lt;code>and&lt;/code>@trt.plugin.autotune` decorators, making it unprecedentedly simple to define, register, and auto-tune plugins without writing C++ code.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>CUDA Graphs&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Through the &lt;code>--use-cuda-graph&lt;/code> flag, TensorRT can leverage CUDA Graphs to capture the entire inference process, further reducing CPU overhead and kernel launch latency, particularly suitable for scenarios with fixed model structures.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>FP8 Support&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>On Hopper and higher architecture GPUs, TensorRT supports FP8 inference, providing higher performance and lower memory usage for large language models and other applications.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="6-appendix-common-commands">6. Appendix: Common Commands&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Install Polygraphy&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 -m pip install polygraphy --extra-index-url https://pypi.ngc.nvidia.com
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build and Install TensorRT Open Source Components&lt;/strong>:
&lt;pre>&lt;code class="language-bash"># From source directory
make install
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run pytest Tests&lt;/strong>:
&lt;pre>&lt;code class="language-bash">pytest --verbose
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="7-tensorrtllm-born-for-large-language-model-inference">7. TensorRT-LLM: Born for Large Language Model Inference&lt;/h2>
&lt;p>As the scale and complexity of large language models (LLMs) grow exponentially, traditional inference optimization methods face unprecedented challenges. To address these challenges, NVIDIA has introduced TensorRT-LLM, an open-source library specifically designed to accelerate and optimize LLM inference. It is built on top of TensorRT and encapsulates a series of cutting-edge optimization techniques for LLMs.&lt;/p>
&lt;h3 id="71-what-is-tensorrtllm">7.1. What is TensorRT-LLM?&lt;/h3>
&lt;p>TensorRT-LLM can be thought of as an &amp;ldquo;LLM expert version&amp;rdquo; of TensorRT. It provides a Python API that allows developers to easily define LLM models and automatically apply various state-of-the-art optimizations. Ultimately, it generates a high-performance TensorRT engine that can be directly deployed.&lt;/p>
&lt;p>Unlike general TensorRT which mainly handles static graphs, TensorRT-LLM specifically addresses the dynamic characteristics in LLM inference, such as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Autoregressive Generation&lt;/strong>: Each newly generated token depends on the previous tokens, resulting in dynamically changing input sequence lengths.&lt;/li>
&lt;li>&lt;strong>Enormous Model Scale&lt;/strong>: Model parameters often number in the billions or even hundreds of billions, making it impossible to deploy on a single GPU.&lt;/li>
&lt;li>&lt;strong>Massive KV Cache&lt;/strong>: The inference process requires storing a large number of key-value pairs (Key-Value Cache), placing extremely high demands on memory bandwidth and capacity.&lt;/li>
&lt;/ul>
&lt;h3 id="72-core-architecture-and-components">7.2. Core Architecture and Components&lt;/h3>
&lt;p>TensorRT-LLM's architecture is divided into frontend and backend:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Python API (&lt;code>tensorrt_llm&lt;/code>)&lt;/strong>: This is the main interface for user interaction. It defines models in a declarative way (similar to PyTorch), allowing developers to avoid dealing with the complex underlying TensorRT C++ API.&lt;/li>
&lt;li>&lt;strong>C++ Backend&lt;/strong>: This is the core that actually performs the optimization, containing pre-written, highly optimized CUDA kernels, LLM-specific optimization passes, and a runtime that can efficiently handle LLM tasks.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-mermaid">graph TD;
subgraph &amp;quot;Frontend (Python API)&amp;quot;
A[Hugging Face / Custom Model] --&amp;gt;|Weights| B(Model Definition&amp;lt;br&amp;gt;tensorrt_llm.Module);
B --&amp;gt; C{Builder};
C -- Generate Network and Config --&amp;gt; D[Network Definition];
end
subgraph &amp;quot;Backend (C++ Runtime)&amp;quot;
D --&amp;gt; E[TensorRT-LLM Optimization];
E --&amp;gt; F((LLM Optimized Engine));
end
subgraph &amp;quot;Inference&amp;quot;
F --&amp;gt; G[C++/Python Runtime];
H[Input Prompts] --&amp;gt; G;
G --&amp;gt; I[Output Tokens];
end
style F fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="73-key-optimization-techniques-llmspecific">7.3. Key Optimization Techniques (LLM-Specific)&lt;/h3>
&lt;p>The magic of TensorRT-LLM lies in its optimization techniques specifically designed for LLMs.&lt;/p>
&lt;h4 id="731-inflight-batching-also-known-as-continuous-batching">7.3.1. In-Flight Batching (also known as Continuous Batching)&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: Traditional static batching requires all requests to wait until a batch is formed before processing them together. Due to the varying generation lengths of each request, this leads to significant GPU idle time (&amp;ldquo;bubbles&amp;rdquo;), as the batch must wait for the slowest request to complete.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: In-Flight Batching allows the server to dynamically add new requests while the GPU is running. Once a request completes, its computational resources are immediately released and allocated to new requests in the waiting queue. This greatly improves GPU utilization and overall system throughput.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">gantt
title GPU Utilization Comparison
dateFormat X
axisFormat %S
section Static Batching
Request A: 0, 6
Request B: 0, 3
Request C: 0, 5
GPU Waiting : 3, 3
GPU Waiting : 5, 1
section In-Flight Batching
Request A : 0, 6
Request B : 0, 3
Request C : 0, 5
New Request D : 3, 4
&lt;/code>&lt;/pre>
&lt;h4 id="732-paged-kv-cache--attention">7.3.2. Paged KV Cache &amp;amp; Attention&lt;/h4>
&lt;p>&lt;strong>Problem&lt;/strong>: In the autoregressive generation process, the KV cache grows linearly with sequence length, consuming large amounts of GPU memory. The traditional approach is to pre-allocate a continuous memory block for each request that can accommodate the maximum sequence length, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>&lt;strong>Solution&lt;/strong>: Inspired by operating system virtual memory paging, TensorRT-LLM introduced Paged KV Cache. It divides the KV cache into fixed-size &amp;ldquo;blocks&amp;rdquo; and allocates them as needed.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Non-contiguous Storage&lt;/strong>: KV caches for logically continuous tokens can be stored in physically non-contiguous blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: For complex scenarios (such as parallel sampling, Beam Search), different sequences can share the same KV cache blocks (e.g., sharing the cache for the prompt portion), significantly saving memory.&lt;/li>
&lt;li>&lt;strong>Optimized Attention Kernels&lt;/strong>: TensorRT-LLM uses specially optimized Attention kernels such as FlashAttention and MQA/GQA that can directly operate on these non-contiguous cache blocks, avoiding data copy overhead.&lt;/li>
&lt;/ul>
&lt;h4 id="733-tensor--pipeline-parallelism">7.3.3. Tensor &amp;amp; Pipeline Parallelism&lt;/h4>
&lt;p>For large models that cannot fit on a single GPU, TensorRT-LLM has built-in seamless support for tensor parallelism and pipeline parallelism. Developers only need to specify the parallelism degree (&lt;code>tp_size&lt;/code>, &lt;code>pp_size&lt;/code>) during building, and TensorRT-LLM will automatically handle model splitting and cross-GPU communication.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Example: Build a Llama model with 2-way tensor parallelism
python3 examples/llama/convert_checkpoint.py \
--model_dir ./llama-7b-hf \
--output_dir ./tllm_checkpoint_tp2 \
--dtype float16 \
--tp_size 2
&lt;/code>&lt;/pre>
&lt;h4 id="734-advanced-quantization-support-fp8int4int8">7.3.4. Advanced Quantization Support (FP8/INT4/INT8)&lt;/h4>
&lt;p>The enormous parameter count of LLMs makes them ideal candidates for quantization. TensorRT-LLM supports various advanced quantization schemes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>FP8&lt;/strong>: On NVIDIA Hopper and higher architecture GPUs, FP8 provides precision close to FP16 while significantly improving performance and reducing memory usage.&lt;/li>
&lt;li>&lt;strong>INT8 SmoothQuant&lt;/strong>: A technique that quantizes both activations and weights, achieving INT8 acceleration while maintaining high precision.&lt;/li>
&lt;li>&lt;strong>INT4/INT8 Weight-Only Quantization (W4A16/W8A16)&lt;/strong>: This is a very popular technique that only quantizes model weights (the largest part of parameters) to INT4 or INT8, while keeping activations in FP16. This greatly reduces memory usage with minimal impact on accuracy.&lt;/li>
&lt;/ul>
&lt;pre>&lt;code class="language-bash"># Example: Build a model with INT4 weight-only quantization
python convert_checkpoint.py --model_dir ./gpt-j-6b \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./trt_ckpt/gptj_int4wo_tp1/
&lt;/code>&lt;/pre>
&lt;h3 id="74-tensorrtllm-workflow">7.4. TensorRT-LLM Workflow&lt;/h3>
&lt;p>A typical TensorRT-LLM workflow is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant D as Developer
participant HF as Hugging Face Hub
participant Conv as convert_checkpoint.py
participant Build as trtllm-build
participant App as Inference Application (Python/C++)
D-&amp;gt;&amp;gt;HF: Download Model Weights
HF--&amp;gt;&amp;gt;D: model_dir
D-&amp;gt;&amp;gt;Conv: Run Conversion Script (Specify Precision, Parallelism, etc.)
Conv--&amp;gt;&amp;gt;D: Generate TensorRT-LLM Checkpoint
D-&amp;gt;&amp;gt;Build: Run Build Command (Specify Plugins, BatchSize, etc.)
Build--&amp;gt;&amp;gt;D: Generate Optimized .engine File
D-&amp;gt;&amp;gt;App: Load Engine and Run Inference
App--&amp;gt;&amp;gt;D: Return Generation Results
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>End-to-End Example (Using Llama-7B)&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Convert Weights&lt;/strong>:
&lt;pre>&lt;code class="language-bash">git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Llama-2-7b-hf \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Build Engine&lt;/strong>:
&lt;pre>&lt;code class="language-bash">trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama_7b \
--gpt_attention_plugin float16 \
--gemm_plugin float16
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Run Inference&lt;/strong>:
&lt;pre>&lt;code class="language-bash">python3 examples/run.py --max_output_len=100 \
--tokenizer_dir ./Llama-2-7b-hf \
--engine_dir=./trt_engines/llama_7b
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;h3 id="75-convenient-highlevel-api-llm">7.5. Convenient High-Level API (&lt;code>LLM&lt;/code>)&lt;/h3>
&lt;p>To further simplify the development process, TensorRT-LLM provides a high-level API called &lt;code>LLM&lt;/code>. This interface encapsulates model loading, building, saving, and inference into a simple class, allowing developers to complete all operations in just a few lines of code.&lt;/p>
&lt;pre>&lt;code class="language-python">from tensorrt_llm import LLM
# 1. Initialize LLM object, if the engine doesn't exist, it will automatically build from HuggingFace model
# All optimizations like In-Flight Batching, Paged KV-Cache will be applied here
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;,
tensor_parallel_size=1,
)
# 2. (Optional) Save the built engine for later use
llm.save(&amp;quot;llama_engine_dir&amp;quot;)
# 3. Run inference
prompt = &amp;quot;NVIDIA TensorRT-LLM is&amp;quot;
for output in llm.generate([prompt], max_new_tokens=50):
print(output)
&lt;/code>&lt;/pre>
&lt;p>This high-level API is ideal for rapid prototyping and deployment.&lt;/p>
&lt;h3 id="76-conclusion">7.6. Conclusion&lt;/h3>
&lt;p>TensorRT-LLM is not simply applying TensorRT to LLMs, but a comprehensive solution fundamentally redesigned for LLM inference, containing multiple state-of-the-art optimizations. Through In-Flight Batching, Paged KV-Cache, native parallel support, and advanced quantization schemes, it can maximize the hardware performance of NVIDIA GPUs, providing a solid foundation for deploying high-performance, high-throughput LLM services.&lt;/p></description></item><item><title>RAG Data Augmentation Techniques: Key Methods for Bridging the Semantic Gap</title><link>https://ziyanglin.netlify.app/en/post/rag-data-augmentation/</link><pubDate>Sat, 28 Jun 2025 16:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/rag-data-augmentation/</guid><description>&lt;h2 id="1-introduction-why-rag-needs-data-augmentation">1. Introduction: Why RAG Needs Data Augmentation?&lt;/h2>
&lt;h3 id="11-understanding-the-semantic-gap">1.1 Understanding the &amp;ldquo;Semantic Gap&amp;rdquo;&lt;/h3>
&lt;p>The core of Retrieval-Augmented Generation (RAG) lies in the &amp;ldquo;retrieval&amp;rdquo; component. However, in practical applications, the retrieval step often becomes the bottleneck of the entire system. The root cause is the &lt;strong>&amp;ldquo;Semantic Gap&amp;rdquo;&lt;/strong> or &lt;strong>&amp;ldquo;Retrieval Mismatch&amp;rdquo;&lt;/strong>.&lt;/p>
&lt;p>Specifically, this problem manifests in:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Diversity and Uncertainty of User Queries&lt;/strong>: Users ask questions in countless ways, potentially using colloquial language, abbreviations, typos, or describing the same issue from different angles.&lt;/li>
&lt;li>&lt;strong>Fixed and Formal Nature of Knowledge Base Documents&lt;/strong>: Documents in knowledge bases are typically structured and formal, with relatively fixed terminology.&lt;/li>
&lt;/ul>
&lt;p>This leads to a situation where the user's query vector and the document chunk vectors in the knowledge base may be far apart in vector space, even when they are semantically related.&lt;/p>
&lt;p>&lt;strong>For example:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Knowledge Base Document&lt;/strong>: &lt;code># ThinkPad X1 Carbon Cooling Guide\n\nIf your ThinkPad X1 Carbon is experiencing overheating issues, you can try cleaning the fan, updating the BIOS, or selecting balanced mode in power management...&lt;/code>&lt;/li>
&lt;li>&lt;strong>Possible User Queries&lt;/strong>:
&lt;ul>
&lt;li>&amp;ldquo;My laptop is too hot, what should I do?&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;Is my Lenovo laptop fan noise due to overheating?&amp;rdquo; (Even though the brand doesn't exactly match, the issue is essentially similar)&lt;/li>
&lt;li>&amp;ldquo;Computer gets very hot, games are lagging&amp;rdquo;&lt;/li>
&lt;li>&amp;ldquo;How can I cool down my ThinkPad?&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>In a standard RAG workflow, these queries might fail to accurately retrieve the cooling guide mentioned above because their literal expressions and vector representations are too different.&lt;/p>
&lt;h3 id="12-standard-rag-workflow">1.2 Standard RAG Workflow&lt;/h3>
&lt;p>To better understand the problem, let's first look at the standard RAG workflow.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[User Input Query] --&amp;gt; B{Encoder};
B --&amp;gt; C[Query Vector];
C --&amp;gt; D{Vector Database};
E[Knowledge Base Documents] --&amp;gt; F{Encoder};
F --&amp;gt; G[Document Chunk Vectors];
G --&amp;gt; D;
D -- Vector Similarity Search --&amp;gt; H[Top-K Relevant Document Chunks];
A --&amp;gt; I((LLM));
H --&amp;gt; I;
I --&amp;gt; J[Generate Final Answer];
style A fill:#f9f,stroke:#333,stroke-width:2px
style J fill:#ccf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Figure 1: Standard RAG System Workflow&lt;/em>&lt;/p>
&lt;p>As shown above, the entire retrieval process heavily relies on the similarity between the &lt;code>Query Vector&lt;/code> and &lt;code>Chunk Vectors&lt;/code>. If there is a &amp;ldquo;semantic gap&amp;rdquo; between them, the retrieval effectiveness will be significantly reduced.&lt;/p>
&lt;p>The core objective of &lt;strong>Data Augmentation/Generalization&lt;/strong> is to proactively generate a large number of potential, semantically equivalent but expressively diverse &amp;ldquo;virtual queries&amp;rdquo; or &amp;ldquo;equivalent descriptions&amp;rdquo; for each document chunk in the knowledge base, thereby preemptively bridging this gap on the knowledge base side.&lt;/p>
&lt;h2 id="2-llmbased-data-augmentationgeneralization-techniques-deep-dive-into-details">2. LLM-Based Data Augmentation/Generalization Techniques: Deep Dive into Details&lt;/h2>
&lt;p>Leveraging the powerful language understanding and generation capabilities of Large Language Models (LLMs) is the most efficient and mainstream approach to data augmentation/generalization. The core idea is: &lt;strong>Let the LLM play the role of users and generate various possible questions and expressions for each knowledge chunk.&lt;/strong>&lt;/p>
&lt;p>There are two main technical implementation paths: &lt;strong>Hypothetical Questions Generation&lt;/strong> and &lt;strong>Summarization &amp;amp; Paraphrasing&lt;/strong>.&lt;/p>
&lt;h3 id="21-technical-path-one-hypothetical-questions-generation">2.1 Technical Path One: Hypothetical Questions Generation&lt;/h3>
&lt;p>This is the most direct and effective method. For each document chunk in the knowledge base, we have the LLM generate a set of questions that can be answered by this document chunk.&lt;/p>
&lt;h4 id="technical-implementation-details">Technical Implementation Details:&lt;/h4>
&lt;ol>
&lt;li>&lt;strong>Document Chunking&lt;/strong>: First, split the original document into meaningful, appropriately sized knowledge chunks. This is the foundation of all RAG systems.&lt;/li>
&lt;li>&lt;strong>Generate Questions for Each Chunk&lt;/strong>:
&lt;ul>
&lt;li>Iterate through each chunk.&lt;/li>
&lt;li>Feed the content of the chunk as context to an LLM.&lt;/li>
&lt;li>Use a carefully designed prompt (see Chapter 3) to instruct the LLM to generate N questions closely related to the chunk's content.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Data Organization and Indexing&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Key Step&lt;/strong>: Associate the N generated questions with the original chunk. When vectorizing, &lt;strong>don't vectorize the questions themselves&lt;/strong>, but process each generated &amp;ldquo;question-original text pair&amp;rdquo;. A common approach is to concatenate the question and original text when vectorizing, or associate the question as metadata with the original chunk's vector during indexing.&lt;/li>
&lt;li>A more common practice is to store &lt;strong>both the vectors of the generated questions&lt;/strong> and &lt;strong>the vector of the original chunk&lt;/strong> in the vector database, all pointing to the same original chunk ID. This way, when a user queries, whether they match the original chunk or one of the generated questions, they can ultimately locate the correct original text.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Store in Vector Database&lt;/strong>: Store the processed data (original chunk vectors, generated question vectors) and their metadata (such as original ID) in a vector database (like ChromaDB, Milvus, Qdrant, etc.).&lt;/li>
&lt;/ol>
&lt;h4 id="workflow-diagram">Workflow Diagram:&lt;/h4>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Offline Processing&amp;quot;
A[Original Document] --&amp;gt; B(Chunking);
B --&amp;gt; C{Iterate Each Chunk};
C --&amp;gt; D[LLM Generator];
D -- &amp;quot;Generate for Chunk n&amp;quot; --&amp;gt; E[Generated Multiple Questions];
Chunk_n --&amp;gt; F{Encoder};
F --&amp;gt; G[Vector of Chunk_n];
G -- &amp;quot;Points to Chunk_n ID&amp;quot; --&amp;gt; H((Vector Database));
E --&amp;gt; I{Encoder};
I --&amp;gt; J[Vectors of All Generated Questions];
J -- &amp;quot;All Point to Chunk_n ID&amp;quot; --&amp;gt; H;
subgraph &amp;quot;Original Knowledge&amp;quot;
direction LR
Chunk_n(Chunk n);
end
end
subgraph &amp;quot;Online Retrieval&amp;quot;
K[User Query] --&amp;gt; L{Encoder};
L --&amp;gt; M[Query Vector];
M --&amp;gt; H;
H -- &amp;quot;Vector Retrieval&amp;quot; --&amp;gt; N{Top-K Results};
N --&amp;gt; O[Get Original Chunk by ID];
end
style D fill:#c7f4c8,stroke:#333,stroke-width:2px;
style H fill:#f8d7da,stroke:#333,stroke-width:2px;
style E fill:#f9e79f,stroke:#333,stroke-width:2px;
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Figure 2: Data-Augmented RAG Workflow with Hypothetical Questions Generation&lt;/em>&lt;/p>
&lt;p>This method greatly enriches the &amp;ldquo;retrievability&amp;rdquo; of each knowledge chunk, essentially creating multiple different &amp;ldquo;entry points&amp;rdquo; for each piece of knowledge.&lt;/p>
&lt;h3 id="22-technical-path-two-summarization--paraphrasing">2.2 Technical Path Two: Summarization &amp;amp; Paraphrasing&lt;/h3>
&lt;p>Besides generating questions, we can also generate summaries of knowledge chunks or rewrite them in different ways.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Summarization&lt;/strong>: For a relatively long knowledge chunk, an LLM can generate a concise core summary. This summary can serve as a &amp;ldquo;coarse-grained&amp;rdquo; retrieval entry point. When a user's query is relatively broad, it might more easily match with the summary.&lt;/li>
&lt;li>&lt;strong>Paraphrasing&lt;/strong>: Have the LLM rewrite the core content of the same knowledge chunk using different sentence structures and vocabulary. This also creates new vectors that are different from the original text vector but semantically consistent.&lt;/li>
&lt;/ul>
&lt;h4 id="technical-implementation-details1">Technical Implementation Details:&lt;/h4>
&lt;p>The implementation method is similar to hypothetical question generation, except that the prompt's goal changes from &amp;ldquo;generating questions&amp;rdquo; to &amp;ldquo;generating summaries&amp;rdquo; or &amp;ldquo;paraphrasing&amp;rdquo;. The generated data is similarly associated with the original chunk, and its vector is stored in the database.&lt;/p>
&lt;p>In practice, &lt;strong>hypothetical question generation is usually more popular than summarization/paraphrasing&lt;/strong> because it more directly simulates the user's &amp;ldquo;questioning&amp;rdquo; behavior, aligning better with the essence of the retrieval task.&lt;/p>
&lt;h2 id="3-prompt-engineering-for-data-generalization-an-excellent-example">3. Prompt Engineering for Data Generalization: An Excellent Example&lt;/h2>
&lt;p>The quality of the prompt directly determines the quality of the generated data. A good prompt should be like a precise scalpel, guiding the LLM to generate the data we want.&lt;/p>
&lt;p>Below is a well-considered prompt example designed for the &amp;ldquo;hypothetical questions generation&amp;rdquo; task:&lt;/p>
&lt;pre>&lt;code class="language-text">### Role and Goal
You are an advanced AI assistant tasked with generating a set of high-quality, diverse questions for a given knowledge text (Context). These questions should be fully answerable by the provided text. Your goal is to help build a smarter Q&amp;amp;A system that can find answers regardless of how users phrase their questions, as long as they relate to the text content.
### Instructions
Based on the `[Original Text]` provided below, please generate **5** different questions.
### Requirements
1. **Diversity**: The generated questions must differ in sentence structure, wording, and intent. Try to ask from different angles, for example:
* **How-to type**: How to operate...?
* **Why type**: Why does...happen?
* **What is type**: What does...mean?
* **Comparison type**: What's the difference between...and...?
* **What-if type**: What if...?
2. **Persona**: Imagine you are different types of users asking questions:
* A **Beginner** who knows nothing about this field.
* An **Expert** seeking in-depth technical details.
* A **Student** looking for answers for an assignment.
3. **Fully Answerable**: Ensure each generated question can be fully and only answered using information from the `[Original Text]`. Don't ask questions that require external knowledge.
4. **Language Style**: Questions should be natural, clear, and conform to conversational English.
### Output Format
Please output strictly in the following JSON format, without any additional explanations or text:
```json
{
&amp;quot;generated_questions&amp;quot;: [
{
&amp;quot;persona&amp;quot;: &amp;quot;beginner&amp;quot;,
&amp;quot;question&amp;quot;: &amp;quot;First question here&amp;quot;
},
{
&amp;quot;persona&amp;quot;: &amp;quot;expert&amp;quot;,
&amp;quot;question&amp;quot;: &amp;quot;Second question here&amp;quot;
},
{
&amp;quot;persona&amp;quot;: &amp;quot;student&amp;quot;,
&amp;quot;question&amp;quot;: &amp;quot;Third question here&amp;quot;
},
// ... more questions
]
}
&lt;/code>&lt;/pre>
&lt;h3 id="original-text">[Original Text]&lt;/h3>
&lt;p>{context_chunk}&lt;/p>
&lt;pre>&lt;code>
#### Prompt Design Analysis:
* **Role and Goal**: Gives the LLM a clear positioning, helping it understand the significance of the task, rather than just mechanically executing it.
* **Diversity Requirements**: This is the most critical part. It guides the LLM to think from different dimensions, avoiding generating a large number of homogeneous questions (e.g., simply turning statements into questions).
* **Persona Role-Playing**: This instruction greatly enriches the diversity of questions. A beginner's questions might be broader and more colloquial, while an expert's questions might be more specific and technical.
* **Fully Answerable**: This is an important constraint, ensuring the strong relevance of generated questions to the original text, avoiding introducing noise.
* **JSON Output Format**: Forced structured output makes the LLM's return results easily parsable and processable by programs, an essential element in automated workflows.
## 4. Effect Validation: How to Measure the Effectiveness of Data Augmentation?
Data augmentation is not a process that is &amp;quot;automatically good once done&amp;quot;; a scientific evaluation system must be established to verify its effectiveness. Evaluation should be conducted from two aspects: **recall rate** and **final answer quality**.
### 4.1 Retrieval Evaluation
This is the core metric for evaluating improvements in the retrieval component.
#### Steps:
1. **Build an Evaluation Dataset**: This is the most critical step. You need to create a test set containing `(question, corresponding correct original Chunk_ID)` pairs. The questions in this test set should be as diverse as possible, simulating real user queries.
2. **Conduct Two Tests**:
* **Experimental Group A (Without Data Augmentation)**: Use the standard RAG process to retrieve with questions from the test set, recording the Top-K Chunk IDs recalled.
* **Experimental Group B (With Data Augmentation)**: Use a knowledge base integrated with data augmentation, retrieve with the same questions, and record the Top-K Chunk IDs recalled.
3. **Calculate Evaluation Metrics**:
* **Recall@K**: What proportion of questions in the test set had their corresponding correct Chunk_ID appear in the top K of the recall results? This is the most important metric. `Recall@K = (Number of correctly recalled questions) / (Total number of questions)`.
* **Precision@K**: How many of the top K results recalled are correct? For a single question, if there is only one correct answer, then Precision@K is either 1/K or 0.
* **MRR (Mean Reciprocal Rank)**: The average of the reciprocal of the rank of the correct answer in the recall list. This metric not only cares about whether it was recalled but also how high it was ranked. The higher the ranking, the higher the score. `MRR = (1/N) * Σ(1 / rank_i)`, where `N` is the total number of questions, and `rank_i` is the rank of the correct answer for the i-th question.
By comparing the `Recall@K` and `MRR` metrics of experimental groups A and B, you can quantitatively determine whether data augmentation has improved recall performance.
### 4.2 Generation Quality Evaluation
Improved recall rate is a prerequisite, but it doesn't completely equate to improved user experience. We also need to evaluate the final answers generated by the RAG system end-to-end.
#### Method One: Human Evaluation
This is the most reliable but most costly method.
1. **Design Evaluation Dimensions**:
* **Relevance**: Does the generated answer get to the point and address the user's question?
* **Accuracy/Factuality**: Is the information in the answer accurate and based on the retrieved knowledge?
* **Fluency**: Is the language of the answer natural and smooth?
2. **Conduct Blind Evaluation**: Have evaluators score (e.g., 1-5 points) or compare (A is better/B is better/tie) two sets of answers without knowing which answer comes from which system (before/after enhancement).
3. **Statistical Analysis**: Determine whether data augmentation has a positive impact on the final answer quality through statistical scores or win rates.
#### Method Two: LLM-based Automatic Evaluation
This is a more efficient alternative, using a more powerful, advanced LLM (such as GPT-4o, Claude 3.5 Sonnet) as a &amp;quot;judge&amp;quot;.
1. **Design Evaluation Prompt**: Create a prompt asking the judge LLM to compare answers generated by different systems.
* **Input**: User question, retrieved context, System A's answer, System B's answer.
* **Instructions**: Ask the LLM to analyze from dimensions such as relevance and accuracy, determine which answer is better, and output scores and reasons in JSON format.
2. **Batch Execution and Analysis**: Run this evaluation process for all questions in the test set, then calculate win rates.
This method allows for large-scale, low-cost evaluation, making rapid iteration possible.
## 5. Conclusion and Future Outlook
**In summary, LLM-based data augmentation/generalization is a key technology for enhancing RAG system performance, especially for solving the &amp;quot;semantic gap&amp;quot; problem.** By pre-generating a large number of &amp;quot;virtual questions&amp;quot; or equivalent descriptions in the offline phase, it greatly enriches the retrievability of the knowledge base, making the system more adaptable to the diversity of user queries in the real world.
**Practical Considerations:**
* **Balance Between Cost and Quality**: Generating data incurs LLM API call costs and index storage costs. The number of data to generate for each chunk needs to be decided based on budget and performance improvement needs.
* **Cleaning Generated Data**: LLM generation is not 100% perfect and may produce low-quality or irrelevant questions. Consider adding a validation step to filter out poor-quality data.
**Future Outlook:**
* **Combination with Rerankers**: Data augmentation aims to improve &amp;quot;recall,&amp;quot; while reranker models aim to optimize &amp;quot;ranking.&amp;quot; Combining the two—ensuring relevant content is recalled through data augmentation, then fine-ranking through reranker models—is the golden combination for RAG optimization.
* **Multimodal Data Augmentation**: With the development of multimodal large models, future RAG will process more than just text. How to perform data augmentation for image and audio/video knowledge (e.g., generating text questions about image content) will be an interesting research direction.
* **Adaptive Data Augmentation**: Future systems might automatically discover recall failure cases based on real user queries online, and perform targeted data augmentation for relevant knowledge chunks, forming a continuously optimizing closed loop.&lt;/code>&lt;/pre></description></item><item><title>SIP and VoIP Communication Technology: A Comprehensive Guide from Principles to Practice</title><link>https://ziyanglin.netlify.app/en/post/sip-voip-technical-analysis/</link><pubDate>Sat, 28 Jun 2025 14:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/sip-voip-technical-analysis/</guid><description>&lt;h2 id="1-introduction-the-world-of-voip-and-sip">1. Introduction: The World of VoIP and SIP&lt;/h2>
&lt;h3 id="11-what-is-voip">1.1 What is VoIP?&lt;/h3>
&lt;p>VoIP (Voice over Internet Protocol) is a revolutionary technology that transmits voice communications over IP networks. Essentially, it digitizes, compresses, and packages human voice (analog signals), transmits them through IP networks (like the internet), and then unpacks, decompresses, and converts them back to sound at the receiving end.&lt;/p>
&lt;p>&lt;strong>Core Concept&lt;/strong>: Treating voice as data, transmitting it over networks just like sending emails or browsing websites.&lt;/p>
&lt;p>This breaks the dependency on physical telephone lines that traditional telephone systems (PSTN - Public Switched Telephone Network) rely on, bringing tremendous flexibility and cost advantages.&lt;/p>
&lt;h3 id="12-sip-the-traffic-director-of-voip">1.2 SIP: The &amp;ldquo;Traffic Director&amp;rdquo; of VoIP&lt;/h3>
&lt;p>If VoIP is a complete communication system, then SIP (Session Initiation Protocol) is its brain and traffic director.&lt;/p>
&lt;p>SIP itself doesn't transmit voice data. Its core responsibility is &lt;strong>signaling&lt;/strong>, handling the &lt;strong>creation (Setup), management, and termination (Teardown)&lt;/strong> of communication sessions.&lt;/p>
&lt;p>It can be understood this way:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>You want to call a friend&lt;/strong>: SIP is responsible for finding where your friend is (address resolution), telling their phone &amp;ldquo;someone's looking for you,&amp;rdquo; making their phone ring (session invitation).&lt;/li>
&lt;li>&lt;strong>Your friend answers the call&lt;/strong>: SIP confirms both parties are ready and the conversation can begin.&lt;/li>
&lt;li>&lt;strong>The call ends, you hang up&lt;/strong>: SIP notifies both parties that the call has ended and resources can be released.&lt;/li>
&lt;/ul>
&lt;p>SIP is an application layer protocol deeply influenced by HTTP and SMTP, using text format, easy to understand and extend. Due to its flexibility and powerful functionality, SIP has become the mainstream signaling protocol in modern VoIP systems.&lt;/p>
&lt;h3 id="13-voip-vs-pstn-a-communication-revolution">1.3 VoIP vs. PSTN: A Communication Revolution&lt;/h3>
&lt;p>To more intuitively understand the disruptive nature of VoIP, we can compare it with traditional PSTN.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">PSTN (Traditional Telephone)&lt;/th>
&lt;th align="left">VoIP (Network Telephone)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Network Foundation&lt;/strong>&lt;/td>
&lt;td align="left">Dedicated, circuit-switched network&lt;/td>
&lt;td align="left">Common, packet-switched IP network&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Connection Method&lt;/strong>&lt;/td>
&lt;td align="left">Establishes a physical exclusive line before calling&lt;/td>
&lt;td align="left">Data packets are independently routed in the network, sharing bandwidth&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Core Principle&lt;/strong>&lt;/td>
&lt;td align="left">Circuit switching&lt;/td>
&lt;td align="left">Packet switching&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Functionality&lt;/strong>&lt;/td>
&lt;td align="left">Mainly limited to voice calls&lt;/td>
&lt;td align="left">Integrates voice, video, messaging, presence display, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Cost&lt;/strong>&lt;/td>
&lt;td align="left">Depends on distance and call duration, expensive long-distance calls&lt;/td>
&lt;td align="left">Mainly depends on network bandwidth cost, no difference between long-distance and local calls&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Flexibility&lt;/strong>&lt;/td>
&lt;td align="left">Number bound to physical line&lt;/td>
&lt;td align="left">Number (address) bound to user, can be used anywhere with network access&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Phone A] -- Analog Signal --&amp;gt; B(PSTN Switch)
B -- Establish Physical Circuit --&amp;gt; C(PSTN Switch)
C -- Analog Signal --&amp;gt; D[Phone B]
subgraph Traditional PSTN Call
A &amp;amp; B &amp;amp; C &amp;amp; D
end
E[VoIP Terminal A] -- Digital Packets --&amp;gt; F{Internet / IP Network}
F -- Digital Packets --&amp;gt; G[VoIP Terminal B]
subgraph VoIP Call
E &amp;amp; F &amp;amp; G
end
&lt;/code>&lt;/pre>
&lt;p>In the following chapters, we will delve into the technology stack that makes up VoIP systems and analyze every detail of the SIP protocol.&lt;/p>
&lt;h2 id="2-voip-core-technology-stack-macro-perspective">2. VoIP Core Technology Stack (Macro Perspective)&lt;/h2>
&lt;p>From a macro perspective, VoIP is not a single technology but a complex yet orderly technological system composed of multiple protocols working together. Understanding its layered architecture is key to grasping the global view of VoIP.&lt;/p>
&lt;h3 id="21-layered-architecture">2.1 Layered Architecture&lt;/h3>
&lt;p>The VoIP technology stack can be roughly divided into four layers, each depending on the services provided by the layer below it.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;&amp;lt;b&amp;gt;Application Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;SIP, SDP, Voice/Video Applications&amp;quot;]
B[&amp;quot;&amp;lt;b&amp;gt;Transport Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;UDP, TCP, RTP, RTCP&amp;quot;]
C[&amp;quot;&amp;lt;b&amp;gt;Network Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;IP&amp;quot;]
D[&amp;quot;&amp;lt;b&amp;gt;Data Link &amp;amp; Physical Layer&amp;lt;/b&amp;gt;&amp;lt;br/&amp;gt;Ethernet, Wi-Fi, 4G/5G&amp;quot;]
A --&amp;gt; B --&amp;gt; C --&amp;gt; D
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Application Layer&lt;/strong>: This is the layer closest to users.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Signaling Protocols&lt;/strong>: Such as &lt;strong>SIP&lt;/strong>, which we focus on, and its predecessor &lt;strong>H.323&lt;/strong>. They are responsible for control operations like &amp;ldquo;making calls&amp;rdquo; and &amp;ldquo;hanging up.&amp;rdquo;&lt;/li>
&lt;li>&lt;strong>Media Description Protocol&lt;/strong>: &lt;strong>SDP (Session Description Protocol)&lt;/strong> plays a crucial role. It doesn't transmit media but is used to describe media stream attributes in detail, such as: What codec to use (G.711, Opus)? What are the IP address and port? Is it audio or video? SDP content is typically exchanged &amp;ldquo;carried&amp;rdquo; by SIP.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Transport Layer&lt;/strong>: Responsible for end-to-end data transmission.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>UDP (User Datagram Protocol)&lt;/strong>: Due to its real-time, low-overhead characteristics, it is the &lt;strong>preferred choice&lt;/strong> for VoIP media data (voice packets) transmission. It doesn't guarantee reliability, allowing packet loss, which is acceptable for real-time voice (losing a packet or two might just be a momentary noise, while waiting for retransmission would cause serious delay and jitter). The &lt;strong>RTP (Real-time Transport Protocol)&lt;/strong> is built on top of UDP.&lt;/li>
&lt;li>&lt;strong>TCP (Transmission Control Protocol)&lt;/strong>: For signaling messages (like SIP) that require absolute reliability, TCP is typically chosen. It ensures critical commands like &amp;ldquo;INVITE&amp;rdquo; or &amp;ldquo;BYE&amp;rdquo; are not lost. Of course, SIP can also run on UDP and ensure reliability through its own retransmission mechanism.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Network Layer&lt;/strong>: The core is &lt;strong>IP (Internet Protocol)&lt;/strong>, responsible for packet routing and addressing, ensuring data packets can travel from the source through complex networks to reach their destination.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Data Link &amp;amp; Physical Layer&lt;/strong>: This is the most fundamental infrastructure, including Ethernet, Wi-Fi, fiber optics, etc., responsible for transmitting data bit streams over physical media.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="22-key-protocols-overview">2.2 Key Protocols Overview&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Protocol&lt;/th>
&lt;th align="left">Full Name&lt;/th>
&lt;th align="left">Layer&lt;/th>
&lt;th align="left">Core Function&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>SIP&lt;/strong>&lt;/td>
&lt;td align="left">Session Initiation Protocol&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Establish, manage, and terminate multimedia sessions (signaling control).&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>SDP&lt;/strong>&lt;/td>
&lt;td align="left">Session Description Protocol&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Describe media session parameters, such as IP address, port, codec, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>RTP&lt;/strong>&lt;/td>
&lt;td align="left">Real-time Transport Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Carry real-time data (such as voice, video), provide timestamps and sequence numbers.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>RTCP&lt;/strong>&lt;/td>
&lt;td align="left">Real-time Transport Control Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Used in conjunction with RTP, providing Quality of Service (QoS) monitoring and feedback.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>UDP&lt;/strong>&lt;/td>
&lt;td align="left">User Datagram Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Provide low-latency, unreliable datagram transmission for RTP.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>TCP&lt;/strong>&lt;/td>
&lt;td align="left">Transmission Control Protocol&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Provide reliable, connection-oriented transmission for signaling like SIP.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>STUN/TURN/ICE&lt;/strong>&lt;/td>
&lt;td align="left">(See NAT chapter)&lt;/td>
&lt;td align="left">Application Layer&lt;/td>
&lt;td align="left">Used to solve connectivity issues brought by Network Address Translation (NAT).&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>SRTP&lt;/strong>&lt;/td>
&lt;td align="left">Secure Real-time Transport Protocol&lt;/td>
&lt;td align="left">Transport/Application Layer&lt;/td>
&lt;td align="left">Secure version of RTP, providing encryption and authentication for media streams.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>TLS&lt;/strong>&lt;/td>
&lt;td align="left">Transport Layer Security&lt;/td>
&lt;td align="left">Transport Layer&lt;/td>
&lt;td align="left">Used to encrypt SIP signaling (SIPS), ensuring confidentiality and integrity of signaling.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>Having understood this macro picture, we can now delve into the most important protocol—SIP, to explore how it elegantly accomplishes communication direction.&lt;/p>
&lt;h2 id="3-sip-protocol-indepth-analysis-micro-details">3. SIP Protocol In-Depth Analysis (Micro Details)&lt;/h2>
&lt;p>Now, we formally enter the world of SIP. SIP's design philosophy is &amp;ldquo;simplicity&amp;rdquo; and &amp;ldquo;extensibility,&amp;rdquo; borrowing heavily from HTTP design concepts. If you understand HTTP, learning SIP will feel very familiar.&lt;/p>
&lt;h3 id="31-sip-core-components">3.1 SIP Core Components&lt;/h3>
&lt;p>A typical SIP network consists of the following logical components:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph User A
UAC_A[User Agent Client UAC]
UAS_A[User Agent Server UAS]
UAC_A &amp;lt;--&amp;gt; UAS_A
end
subgraph SIP Network Infrastructure
Proxy[Proxy Server]
Registrar[Registrar Server]
Redirect[Redirect Server]
Proxy --- Registrar
end
subgraph User B
UAC_B[User Agent Client UAC]
UAS_B[User Agent Server UAS]
UAC_B &amp;lt;--&amp;gt; UAS_B
end
UAS_A -- SIP Request --&amp;gt; Proxy;
Proxy -- SIP Request --&amp;gt; UAS_B;
UAS_B -- SIP Response --&amp;gt; Proxy;
Proxy -- SIP Response --&amp;gt; UAS_A;
UAS_A -- Register --&amp;gt; Registrar;
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>User Agent (UA)&lt;/strong>: This is the terminal device in the SIP world. It can be:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Hardware Phone&lt;/strong>: Looks like a traditional phone but runs the SIP protocol internally.&lt;/li>
&lt;li>&lt;strong>Softphone&lt;/strong>: An application installed on a computer or mobile phone.&lt;/li>
&lt;li>Any device capable of initiating or receiving SIP sessions.&lt;/li>
&lt;/ul>
&lt;p>A UA contains two parts:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>User Agent Client (UAC)&lt;/strong>: Responsible for &lt;strong>initiating&lt;/strong> SIP requests. When you make a call, your device is a UAC.&lt;/li>
&lt;li>&lt;strong>User Agent Server (UAS)&lt;/strong>: Responsible for &lt;strong>receiving&lt;/strong> SIP requests and providing responses. When your phone rings, your device is a UAS.
In a complete two-way call, &lt;strong>each party's device is simultaneously both a UAC and a UAS&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Proxy Server&lt;/strong>: This is the central nervous system of the SIP network. It receives requests from UACs and &lt;strong>forwards&lt;/strong> them to the target UAS. The proxy server itself does not initiate requests, but it may modify certain parts of the request for policy enforcement (such as billing, routing policies). It is the &amp;ldquo;middleman&amp;rdquo; of the call.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Registrar Server&lt;/strong>: It functions like an &amp;ldquo;address book.&amp;rdquo; When a UA starts up and connects to the network, it sends a &lt;code>REGISTER&lt;/code> request to the Registrar, telling the server: &amp;ldquo;I'm Bob, my SIP address is &lt;code>sip:bob@example.com&lt;/code>, and my current IP address is &lt;code>192.168.1.100&lt;/code>&amp;rdquo;. The Registrar is responsible for maintaining this address mapping relationship (i.e., the binding between the user's SIP URI and their actual network location). When someone wants to call Bob, the Proxy server queries the Registrar to find Bob's current location.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Redirect Server&lt;/strong>: It's somewhat similar to a Proxy, but &amp;ldquo;lazier.&amp;rdquo; When it receives a request, it doesn't forward it itself but directly replies to the UAC with a &amp;ldquo;3xx&amp;rdquo; response, telling the UAC: &amp;ldquo;The person you're looking for is at &lt;code>sip:bob@192.168.1.100&lt;/code>, go find him yourself.&amp;rdquo; The UAC needs to initiate a new request based on this new address. This mode is less common in practical applications than the proxy mode.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="32-sip-messages-the-harmony-with-http">3.2 SIP Messages: The Harmony with HTTP&lt;/h3>
&lt;p>SIP messages are plain text and come in two types: &lt;strong>Request&lt;/strong> and &lt;strong>Response&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>A typical SIP request (INVITE):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>INVITE sip:bob@biloxi.com SIP/2.0
Via: SIP/2.0/UDP pc33.atlanta.com;branch=z9hG4bK776asdhds
Max-Forwards: 70
To: Bob &amp;lt;sip:bob@biloxi.com&amp;gt;
From: Alice &amp;lt;sip:alice@atlanta.com&amp;gt;;tag=1928301774
Call-ID: a84b4c76e66710
CSeq: 314159 INVITE
Contact: &amp;lt;sip:alice@pc33.atlanta.com&amp;gt;
Content-Type: application/sdp
Content-Length: 142
(Message body: SDP content here...)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Request message structure analysis:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request Line&lt;/strong>: &lt;code>Method Request-URI Version&lt;/code>
&lt;ul>
&lt;li>&lt;strong>Method&lt;/strong>: Defines the purpose of the request. Common methods include:
&lt;ul>
&lt;li>&lt;code>INVITE&lt;/code>: Initiates a session invitation.&lt;/li>
&lt;li>&lt;code>ACK&lt;/code>: Confirms a final response to an &lt;code>INVITE&lt;/code>.&lt;/li>
&lt;li>&lt;code>BYE&lt;/code>: Terminates an established session.&lt;/li>
&lt;li>&lt;code>CANCEL&lt;/code>: Cancels an incomplete &lt;code>INVITE&lt;/code> request.&lt;/li>
&lt;li>&lt;code>REGISTER&lt;/code>: Registers user location with a Registrar server.&lt;/li>
&lt;li>&lt;code>OPTIONS&lt;/code>: Queries the capabilities of a server or UA.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Request-URI&lt;/strong>: The target address of the request, i.e., &lt;code>sip:user@domain&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Header Fields&lt;/strong>: Key-value pairs in the form of &lt;code>Field Name: Field Value&lt;/code>, providing detailed information about the message.
&lt;ul>
&lt;li>&lt;code>Via&lt;/code>: Records the path the request has taken. Each hop proxy adds its own address at the top. Response messages will return along the path specified by the &lt;code>Via&lt;/code> header. The &lt;code>branch&lt;/code> parameter is a key part of the transaction ID.&lt;/li>
&lt;li>&lt;code>From&lt;/code> / &lt;code>To&lt;/code>: Represent the initiator and recipient of the call, respectively. The &lt;code>tag&lt;/code> parameter uniquely identifies a party in a call and is key to the dialog.&lt;/li>
&lt;li>&lt;code>Call-ID&lt;/code>: Uniquely identifies a complete call globally. All requests and responses related to this call use the same &lt;code>Call-ID&lt;/code>.&lt;/li>
&lt;li>&lt;code>CSeq&lt;/code>: Command Sequence, containing a number and a method name, used to order and distinguish multiple transactions under the same &lt;code>Call-ID&lt;/code>.&lt;/li>
&lt;li>&lt;code>Contact&lt;/code>: Provides a direct contact address (URI) for the request initiator. In an &lt;code>INVITE&lt;/code>, it tells the other party where subsequent requests (like &lt;code>BYE&lt;/code>) should be sent directly.&lt;/li>
&lt;li>&lt;code>Content-Type&lt;/code>: Describes the media type of the message body, typically &lt;code>application/sdp&lt;/code>.&lt;/li>
&lt;li>&lt;code>Content-Length&lt;/code>: The length of the message body.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>A typical SIP response (200 OK):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>SIP/2.0 200 OK
Via: SIP/2.0/UDP pc33.atlanta.com;branch=z9hG4bK776asdhds;received=192.0.2.4
To: Bob &amp;lt;sip:bob@biloxi.com&amp;gt;;tag=a6c85cf
From: Alice &amp;lt;sip:alice@atlanta.com&amp;gt;;tag=1928301774
Call-ID: a84b4c76e66710
CSeq: 314159 INVITE
Contact: &amp;lt;sip:bob@198.51.100.3&amp;gt;
Content-Type: application/sdp
Content-Length: 131
(Message body: SDP content here...)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Response message structure analysis:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Status Line&lt;/strong>: &lt;code>Version Status-Code Reason-Phrase&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="33-a-complete-call-sip-session-flow-explained">3.3 A Complete Call: SIP Session Flow Explained&lt;/h3>
&lt;p>Below, we use a Mermaid sequence diagram to break down a typical SIP call flow, from user registration, to A calling B, and finally hanging up.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant Alice as Alice's UA
participant Proxy as SIP Proxy Server
participant Registrar as Registrar Server
participant Bob as Bob's UA
Alice-&amp;gt;&amp;gt;Registrar: REGISTER
Registrar-&amp;gt;&amp;gt;Alice: 200 OK
Bob-&amp;gt;&amp;gt;Registrar: REGISTER
Registrar-&amp;gt;&amp;gt;Bob: 200 OK
Alice-&amp;gt;&amp;gt;Proxy: INVITE
Proxy-&amp;gt;&amp;gt;Bob: INVITE
Bob-&amp;gt;&amp;gt;Proxy: 180 Ringing
Proxy-&amp;gt;&amp;gt;Alice: 180 Ringing
Bob-&amp;gt;&amp;gt;Proxy: 200 OK
Proxy-&amp;gt;&amp;gt;Alice: 200 OK
Alice-&amp;gt;&amp;gt;Proxy: ACK
Proxy-&amp;gt;&amp;gt;Bob: ACK
Alice-&amp;gt;&amp;gt;Bob: RTP Media Stream
Bob-&amp;gt;&amp;gt;Alice: RTP Media Stream
Bob-&amp;gt;&amp;gt;Proxy: BYE
Proxy-&amp;gt;&amp;gt;Alice: BYE
Alice-&amp;gt;&amp;gt;Proxy: 200 OK
Proxy-&amp;gt;&amp;gt;Bob: 200 OK
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Flow Breakdown&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Registration (1-4)&lt;/strong>: After coming online, Alice and Bob each register their locations with the Registrar. This is the prerequisite for others to find them.&lt;/li>
&lt;li>&lt;strong>Call (5-12)&lt;/strong>: This is the famous &amp;ldquo;three-way handshake&amp;rdquo; process (&lt;code>INVITE&lt;/code> -&amp;gt; &lt;code>200 OK&lt;/code> -&amp;gt; &lt;code>ACK&lt;/code>).
&lt;ul>
&lt;li>&lt;strong>INVITE&lt;/strong>: Alice initiates the call, carrying her prepared media information (SDP) in the request, describing the media types, codecs, and IP/port she can receive.&lt;/li>
&lt;li>&lt;strong>1xx Provisional Responses&lt;/strong>: The Proxy and Bob return &lt;code>100 Trying&lt;/code> (not shown in the diagram) and &lt;code>180 Ringing&lt;/code>, telling Alice &amp;ldquo;please wait, processing/the other phone is ringing.&amp;rdquo; This effectively prevents the UAC from resending &lt;code>INVITE&lt;/code> due to timeout.&lt;/li>
&lt;li>&lt;strong>200 OK&lt;/strong>: When Bob answers the call, his UA sends a &lt;code>200 OK&lt;/code> response containing &lt;strong>his own SDP information&lt;/strong>. At this point, media negotiation is complete, and both parties know each other's media capabilities and receiving addresses.&lt;/li>
&lt;li>&lt;strong>ACK&lt;/strong>: After receiving the &lt;code>200 OK&lt;/code>, Alice must send an &lt;code>ACK&lt;/code> request to confirm. &lt;code>ACK&lt;/code> is an independent transaction used to confirm the final response. When Bob receives the &lt;code>ACK&lt;/code>, a complete SIP dialog is formally established.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Media Transmission&lt;/strong>: After the dialog is established, Alice and Bob can bypass the Proxy server and &lt;strong>directly&lt;/strong> send RTP voice packets to each other based on the IP and port information obtained from each other's SDP. &lt;strong>The path taken by signaling (through Proxy) and media (P2P) can be different&lt;/strong>, which is an important feature of SIP.&lt;/li>
&lt;li>&lt;strong>Termination (13-16)&lt;/strong>: Either party can end the call by sending a &lt;code>BYE&lt;/code> request. Upon receiving it, the other party replies with a &lt;code>200 OK&lt;/code>, and the call is cleanly terminated.&lt;/li>
&lt;/ol>
&lt;h3 id="34-sdp-blueprint-for-media-sessions">3.4 SDP: Blueprint for Media Sessions&lt;/h3>
&lt;p>SDP (Session Description Protocol) is a perfect match for SIP, but it is an independent protocol (RFC 4566). It doesn't transmit any media data itself but is used to &lt;strong>describe&lt;/strong> media sessions. It's like a blueprint, detailing the specifications of the &amp;ldquo;communication building&amp;rdquo; to be constructed.&lt;/p>
&lt;p>&lt;strong>A typical SDP example (in an INVITE request):&lt;/strong>&lt;/p>
&lt;pre>&lt;code>v=0
o=alice 2890844526 2890844526 IN IP4 pc33.atlanta.com
s=SIP Call
c=IN IP4 192.0.2.4
t=0 0
m=audio 49170 RTP/AVP 0 8 97
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Key fields analysis&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;code>v=0&lt;/code>: Protocol version.&lt;/li>
&lt;li>&lt;code>o=&lt;/code>: (owner/creator) Describes the session initiator's information, including username, session ID, version number, etc.&lt;/li>
&lt;li>&lt;code>s=&lt;/code>: Session name.&lt;/li>
&lt;li>&lt;code>c=&lt;/code>: (connection data) Connection information. &lt;strong>Very important&lt;/strong>, it specifies the address where media streams should be sent (&lt;code>IN&lt;/code> means Internet, &lt;code>IP4&lt;/code> means IPv4, followed by the IP address).&lt;/li>
&lt;li>&lt;code>t=&lt;/code>: (time) Session start and end times, &lt;code>0 0&lt;/code> means permanent.&lt;/li>
&lt;li>&lt;code>m=&lt;/code>: (media description) Media description. &lt;strong>Crucial&lt;/strong>.
&lt;ul>
&lt;li>&lt;code>audio&lt;/code>: Media type is audio.&lt;/li>
&lt;li>&lt;code>49170&lt;/code>: &lt;strong>Port to which media will be sent&lt;/strong>.&lt;/li>
&lt;li>&lt;code>RTP/AVP&lt;/code>: Transport protocol used is RTP.&lt;/li>
&lt;li>&lt;code>0 8 97&lt;/code>: &lt;strong>Proposed codec list&lt;/strong> (payload type). This is a priority list, meaning &amp;ldquo;I prefer to use 0, then 8, then 97.&amp;rdquo;&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>a=rtpmap: ...&lt;/code>: (attribute) Attribute line, mapping the payload type numbers in the &lt;code>m&lt;/code> line to specific codec names and clock frequencies. For example, &lt;code>a=rtpmap:0 PCMU/8000&lt;/code> means payload type 0 corresponds to G.711u (PCMU) with a sampling rate of 8000Hz.&lt;/li>
&lt;/ul>
&lt;p>This model is called the &lt;strong>Offer/Answer Model&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Offer&lt;/strong>: Alice sends her SDP in the &lt;code>INVITE&lt;/code>, which is an Offer, listing all the codecs she supports and her receiving address/port.&lt;/li>
&lt;li>&lt;strong>Answer&lt;/strong>: Upon receiving it, Bob selects a codec he also supports from Alice's list (e.g., PCMA) and returns this selected codec along with &lt;strong>his own&lt;/strong> receiving address/port in the SDP of the &lt;code>200 OK&lt;/code>.&lt;/li>
&lt;/ol>
&lt;p>When Alice receives this Answer, both parties have reached a consensus: use the PCMA codec, Alice sends RTP packets to Bob's IP/port, and Bob sends RTP packets to Alice's IP/port.&lt;/p>
&lt;h2 id="4-media-stream-transmission-rtp-and-rtcp">4. Media Stream Transmission: RTP and RTCP&lt;/h2>
&lt;p>We have successfully established the &amp;ldquo;signaling&amp;rdquo; connection through SIP/SDP, like two airport control centers coordinating flight plans between cities. Now, we need the actual &amp;ldquo;airplanes&amp;rdquo;—the RTP protocol—to transport our &amp;ldquo;passengers&amp;rdquo;—voice and video data.&lt;/p>
&lt;h3 id="41-rtp-born-for-realtime-data">4.1 RTP: Born for Real-time Data&lt;/h3>
&lt;p>RTP (Real-time Transport Protocol, RFC 3550) is a network protocol specifically designed for end-to-end transmission of real-time data such as audio and video. It typically runs on top of &lt;strong>UDP&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>Why UDP?&lt;/strong>
TCP provides reliable, ordered transmission, but at a cost: when a packet is lost, TCP stops sending subsequent packets until the lost packet is retransmitted and successfully received. For real-time voice, this delay paid for &amp;ldquo;reliability&amp;rdquo; is fatal. Losing a small piece of voice (perhaps just a fraction of a second of silence or faint noise) is far better than making the entire conversation stutter for several seconds while waiting for it. RTP is based on this principle of &amp;ldquo;tolerating packet loss, not tolerating delay,&amp;rdquo; choosing UDP as its ideal carrier.&lt;/p>
&lt;p>However, pure UDP just throws data packets to the other side on a &amp;ldquo;best effort&amp;rdquo; basis, providing no timing information or knowledge of packet order. RTP adds an additional header on top of UDP, giving data packets &amp;ldquo;life&amp;rdquo;: &lt;strong>timestamps&lt;/strong> and &lt;strong>sequence numbers&lt;/strong>.&lt;/p>
&lt;p>&lt;strong>RTP Header Structure Explained&lt;/strong>&lt;/p>
&lt;p>A standard RTP header is at least 12 bytes, structured as follows:&lt;/p>
&lt;pre>&lt;code> 0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| Contributing source (CSRC) identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>V (Version, 2 bits)&lt;/strong>: RTP protocol version, currently version 2.&lt;/li>
&lt;li>&lt;strong>P (Padding, 1 bit)&lt;/strong>: Padding bit. If set, indicates there are additional padding bytes at the end of the packet.&lt;/li>
&lt;li>&lt;strong>X (Extension, 1 bit)&lt;/strong>: Extension bit. If set, indicates an extension header follows the standard header.&lt;/li>
&lt;li>&lt;strong>CC (CSRC Count, 4 bits)&lt;/strong>: Contributing source count, indicating the number of CSRC identifiers following the fixed header.&lt;/li>
&lt;li>&lt;strong>M (Marker, 1 bit)&lt;/strong>: Marker bit. Its specific meaning is defined by the particular application profile. For example, in video streams, it can mark the end of a frame. In audio, it can mark the beginning of a silence period.&lt;/li>
&lt;li>&lt;strong>PT (Payload Type, 7 bits)&lt;/strong>: &lt;strong>Payload type&lt;/strong>. This is a very critical field, used to identify what format the media data in the RTP packet is. This number corresponds exactly to what we negotiated in the &lt;code>m=&lt;/code> line and &lt;code>a=rtpmap&lt;/code> line of SDP. For example, if SDP negotiation decides to use PCMU (payload type 0), then all RTP packets carrying PCMU data will have their PT field set to 0. When the receiver sees PT=0, it knows to use the PCMU decoder to process the data.&lt;/li>
&lt;li>&lt;strong>Sequence Number (16 bits)&lt;/strong>: &lt;strong>Sequence number&lt;/strong>. Increments by 1 for each RTP packet sent. This field has two core functions:
&lt;ol>
&lt;li>&lt;strong>Detecting packet loss&lt;/strong>: The receiver can determine if packets have been lost by checking if the received sequence numbers are consecutive.&lt;/li>
&lt;li>&lt;strong>Reordering&lt;/strong>: Due to different paths packets may take in the network, packets sent earlier might arrive later. The receiver can restore the original order of packets using the sequence number.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Timestamp (32 bits)&lt;/strong>: &lt;strong>Timestamp&lt;/strong>. &lt;strong>This is the soul of RTP&lt;/strong>. It records the sampling moment of the media data in the packet. &lt;strong>Note: This timestamp is not a &amp;ldquo;wall clock&amp;rdquo;&lt;/strong> but is based on the media's sampling clock. For example, for audio sampled at 8000Hz, the clock &amp;ldquo;ticks&amp;rdquo; 8000 times per second. If a packet contains 20 milliseconds of audio data, the timestamp of the next packet will increase by &lt;code>8000 * 0.020 = 160&lt;/code>.
The main functions of the timestamp are:
&lt;ol>
&lt;li>&lt;strong>Synchronizing playback and eliminating jitter&lt;/strong>: Jitter refers to variations in packet arrival delay. The receiver sets up a &amp;ldquo;jitter buffer&amp;rdquo; to play media smoothly based on the timestamps on the packets, rather than playing at varying speeds, thus providing a smooth auditory/visual experience.&lt;/li>
&lt;li>&lt;strong>Multimedia synchronization&lt;/strong>: In a call containing both audio and video, audio and video streams are two separate RTP streams (with different SSRCs), but their timestamps can be based on the same reference clock. This allows the receiver to precisely align audio and video, achieving &amp;ldquo;lip sync.&amp;rdquo;&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>SSRC (Synchronization Source, 32 bits)&lt;/strong>: &lt;strong>Synchronization source&lt;/strong>. In an RTP session, each media stream source (such as a microphone or a camera) is assigned a randomly generated, globally unique SSRC value. For example, if Alice is sending both audio and video in a call, she will generate two SSRCs, one for the audio stream and one for the video stream. Intermediate devices like Proxies or Mixers can distinguish different streams based on SSRC.&lt;/li>
&lt;li>&lt;strong>CSRC (Contributing Source)&lt;/strong>: Contributing source. When multiple source media streams pass through a mixer and are merged into one stream, this field lists the SSRCs of all original contributors.
&lt;ul>
&lt;li>&lt;strong>Status Code&lt;/strong>: Very similar to HTTP status codes.
&lt;ul>
&lt;li>&lt;code>1xx&lt;/code> (Provisional): Request received, processing in progress. E.g., &lt;code>180 Ringing&lt;/code>.&lt;/li>
&lt;li>&lt;code>2xx&lt;/code> (Success): Request successfully processed. E.g., &lt;code>200 OK&lt;/code>.&lt;/li>
&lt;li>&lt;code>3xx&lt;/code> (Redirection): Further action needed.&lt;/li>
&lt;li>&lt;code>4xx&lt;/code> (Client Error): Request has syntax errors or cannot be processed on this server.&lt;/li>
&lt;li>&lt;code>5xx&lt;/code> (Server Error): Server failed to process the request.&lt;/li>
&lt;li>&lt;code>6xx&lt;/code> (Global Failure): The request cannot be processed by any server.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Header Fields&lt;/strong>: Most header fields (such as &lt;code>Via&lt;/code>, &lt;code>From&lt;/code>, &lt;code>To&lt;/code>, &lt;code>Call-ID&lt;/code>, &lt;code>CSeq&lt;/code>) are copied from the request to ensure the response can be correctly associated with the request. The &lt;code>tag&lt;/code> parameter in the &lt;code>To&lt;/code> field is added by the called party (UAS).&lt;/li>
&lt;/ul>
&lt;h3 id="42-rtcp-rtps-control-partner">4.2 RTCP: RTP's &amp;ldquo;Control Partner&amp;rdquo;&lt;/h3>
&lt;p>RTP is only responsible for &amp;ldquo;cargo transport,&amp;rdquo; but it doesn't know the quality of the &amp;ldquo;shipping.&amp;rdquo; RTCP (Real-time Transport Control Protocol) is the accompanying &amp;ldquo;quality supervisor.&amp;rdquo; It works in parallel with RTP, periodically sending control packets between participants to monitor the quality of service (QoS) of data transmission.&lt;/p>
&lt;p>RTCP packets and RTP packets use different UDP ports (typically RTP port number + 1). It doesn't transmit any media data itself, only control information, and its bandwidth usage is typically limited to within 5% of RTP bandwidth.&lt;/p>
&lt;p>&lt;strong>Core RTCP packet types and their functions:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Sender Report (SR)&lt;/strong>: Sent by the &lt;strong>media sender&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Contains the sender's SSRC, an &lt;strong>NTP timestamp&lt;/strong> (used for synchronization with the &amp;ldquo;wall clock,&amp;rdquo; enabling absolute time synchronization and cross-media stream synchronization), the RTP timestamp corresponding to the NTP timestamp, and the total number of packets and bytes sent.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: Lets the receiver know how much data has been sent and provides key information needed for cross-media stream synchronization (such as audio-video synchronization).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Receiver Report (RR)&lt;/strong>: Sent by the &lt;strong>media receiver&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Contains the SSRC of the source it is receiving from, and since the last report: &lt;strong>fraction lost&lt;/strong>, &lt;strong>cumulative number of packets lost&lt;/strong>, &lt;strong>highest sequence number received&lt;/strong>, and an estimate of &lt;strong>interarrival jitter&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: &lt;strong>This is the most important QoS feedback mechanism&lt;/strong>. After receiving an RR, the sender can understand the network conditions. If the report shows a high packet loss rate, the sender's application might make intelligent adjustments, such as switching to a more loss-resistant, lower bitrate codec, or notifying the user of poor network conditions.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Source Description (SDES)&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content&lt;/strong>: Provides additional information associated with an SSRC, most importantly the &lt;strong>CNAME (Canonical Name)&lt;/strong>. CNAME is a unique, persistent identifier for each endpoint.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: Used to associate different media streams (such as SSRC_audio and SSRC_video) from the same user. When the receiver sees two streams with the same CNAME, it knows they come from the same participant and can synchronize their playback.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>BYE&lt;/strong>: Used to explicitly indicate that a participant is leaving the session, closing a stream.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>APP&lt;/strong>: Used for application-specific extensions.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Through the collaborative work of RTP and RTCP, VoIP systems can not only efficiently transmit real-time media but also intelligently sense network quality and make adaptive adjustments, which is the technical cornerstone for achieving high-quality call experiences.&lt;/p>
&lt;h2 id="5-nat-traversal-breaking-through-network-barriers">5. NAT Traversal: Breaking Through Network Barriers&lt;/h2>
&lt;p>So far, our discussion of SIP and RTP flows has been based on an ideal assumption: both parties in the call have public IP addresses and can directly access each other. However, in the real world, the vast majority of user devices (computers, phones, IP phones) are behind home or office routers, using private IP addresses (such as &lt;code>192.168.x.x&lt;/code>).&lt;/p>
&lt;p>Network Address Translation (NAT) devices (i.e., what we commonly call routers) play the role of &amp;ldquo;gatekeepers,&amp;rdquo; allowing internal devices to access the internet but, by default, blocking unsolicited connections from the outside. This poses a huge challenge for VoIP communications.&lt;/p>
&lt;h3 id="51-the-nat-challenge">5.1 The NAT Challenge&lt;/h3>
&lt;p>Imagine Alice and Bob are both in their respective home networks, and they both have IP addresses of &lt;code>192.168.1.10&lt;/code>.&lt;/p>
&lt;ol>
&lt;li>Alice initiates a call and honestly fills in her media receiving address in the SDP of her &lt;code>INVITE&lt;/code> request: &lt;code>c=IN IP4 192.168.1.10&lt;/code> and &lt;code>m=audio 49170 ...&lt;/code>.&lt;/li>
&lt;li>This &lt;code>INVITE&lt;/code> request successfully reaches Bob through the SIP proxy.&lt;/li>
&lt;li>Bob's UA sees this SDP and becomes confused. It dutifully tries to send its RTP packets to the address &lt;code>192.168.1.10&lt;/code>. But this address is a private address in Bob's own network (it might even be his neighbor's printer address), not Alice on the public internet!&lt;/li>
&lt;li>The result is: &lt;strong>Media streams (RTP packets) cannot be delivered, and both parties can only hear their own side (or silence)&lt;/strong>.&lt;/li>
&lt;/ol>
&lt;p>This is the core challenge NAT poses to VoIP: &lt;strong>The private address information carried in SDP is useless and misleading to the other party on the public internet&lt;/strong>. To solve this problem, we need a mechanism to discover the device's &amp;ldquo;identity&amp;rdquo; on the public internet and establish a path that can penetrate NAT.&lt;/p>
&lt;h3 id="52-the-three-musketeers-of-nat-traversal-stun-turn-ice">5.2 The Three Musketeers of NAT Traversal: STUN, TURN, ICE&lt;/h3>
&lt;p>To solve the connectivity problems brought by NAT, IETF defined a complete solution, with the ICE protocol at its core, while ICE's work depends on two auxiliary protocols: STUN and TURN.&lt;/p>
&lt;h4 id="1-stun-session-traversal-utilities-for-nat">1. STUN (Session Traversal Utilities for NAT)&lt;/h4>
&lt;p>STUN (RFC 5389) is a simple client-server protocol, with its core functionality acting like a &amp;ldquo;mirror.&amp;rdquo;&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle&lt;/strong>: The UA (client) behind a private network sends a request to a STUN server on the public internet. Upon receiving the request, the STUN server checks which public IP and port the request came from, then packages this address (called the &lt;strong>Server-Reflexive Address&lt;/strong>) in a response and returns it to the client along the original path.&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: After receiving the response, the client sees its &amp;ldquo;appearance&amp;rdquo; on the public internet in the &amp;ldquo;mirror.&amp;rdquo; It now knows: &amp;ldquo;Oh, when I send packets outward, my router maps my source address &lt;code>192.168.1.10:49170&lt;/code> to the public address &lt;code>203.0.113.10:8001&lt;/code>.&amp;rdquo; This way, it can fill this public address and port in the SDP and send it to the other party.&lt;/li>
&lt;/ul>
&lt;p>STUN can also be used to detect the type of NAT (e.g., full cone, restricted cone, port restricted cone, symmetric). Understanding the NAT type helps select the optimal traversal strategy.&lt;/p>
&lt;p>&lt;strong>Limitations&lt;/strong>: STUN is powerless against &amp;ldquo;Symmetric NAT.&amp;rdquo; In this strictest type of NAT, the router not only allocates a public port for each outbound session, but this port mapping relationship is &lt;strong>only valid for a specific destination IP and port&lt;/strong>. The public address &lt;code>203.0.113.10:8001&lt;/code> that Alice discovers through the STUN server is a dedicated mapping for her communication with the STUN server; Bob cannot use this address to send data to Alice.&lt;/p>
&lt;h4 id="2-turn-traversal-using-relays-around-nat">2. TURN (Traversal Using Relays around NAT)&lt;/h4>
&lt;p>When STUN fails due to symmetric NAT or other firewall policies, TURN (RFC 8656) is needed as the final &amp;ldquo;fallback&amp;rdquo; solution.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle&lt;/strong>: A TURN server is not just a &amp;ldquo;mirror&amp;rdquo;; it is a fully functional &lt;strong>public media relay&lt;/strong>.
&lt;ol>
&lt;li>The client first &lt;strong>allocates&lt;/strong> a relay address (public IP and port) on the TURN server.&lt;/li>
&lt;li>Then, the client tells the peer (through SIP/SDP) to send its media packets to this relay address.&lt;/li>
&lt;li>At the same time, the client also sends its media packets to the peer through the TURN server.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;li>&lt;strong>Function&lt;/strong>: All media streams are forwarded through the TURN server. Although this increases latency and consumes server bandwidth, it &lt;strong>guarantees connectivity&lt;/strong> because both communicating parties are actually communicating with the TURN server, which has a public address.&lt;/li>
&lt;/ul>
&lt;h4 id="3-ice-interactive-connectivity-establishment">3. ICE (Interactive Connectivity Establishment)&lt;/h4>
&lt;p>ICE (RFC 8445) is the real &amp;ldquo;commander-in-chief.&amp;rdquo; It doesn't invent new protocols but cleverly integrates STUN and TURN to form a systematic framework, establishing media paths between communicating parties in the most effective way.&lt;/p>
&lt;p>The ICE workflow can be divided into the following stages:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph 1. Gathering Candidates
A[UA-A] --&amp;gt;|STUN Request| B(STUN Server);
B --&amp;gt;|Server-Reflexive Addr| A;
A --&amp;gt;|Allocate Request| C(TURN Server);
C --&amp;gt;|Relayed Addr| A;
A --&amp;gt; D{Host Address};
D &amp;amp; B &amp;amp; C --&amp;gt; E((A's Candidate List));
end
subgraph 2. Exchanging Candidates
E -- via SIP/SDP --&amp;gt; F((B's Candidate List));
F -- via SIP/SDP --&amp;gt; E;
end
subgraph 3. Connectivity Checks
G(Candidate Pairs);
E &amp;amp; F --&amp;gt; G;
G --&amp;gt;|STUN Binding Requests| H{Check All Possible Paths};
end
subgraph 4. Selecting Best Path
H --&amp;gt;|Select Highest Priority&amp;lt;br/&amp;gt;Valid Path| I[Establish RTP/RTCP Streams];
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>ICE Process Explained&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Gathering Candidates&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Host Candidates&lt;/strong>: The UA first collects all IP addresses and ports on its local network interfaces.&lt;/li>
&lt;li>&lt;strong>Server-Reflexive Candidates&lt;/strong>: The UA uses a STUN server to discover its public mapping address.&lt;/li>
&lt;li>&lt;strong>Relayed Candidates&lt;/strong>: The UA allocates a relay address using a TURN server.&lt;/li>
&lt;li>In the end, each UA generates a list of candidates of various types with different priorities.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Exchanging Candidates&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Both parties exchange their candidate lists through the signaling channel (typically in the SDP of SIP's &lt;code>INVITE`/&lt;/code>200 OK` messages).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Connectivity Checks&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>After receiving the other party's address list, each UA pairs its local candidates with the other party's candidates, forming a &lt;strong>Candidate Pair&lt;/strong> list, sorted by priority (P2P &amp;gt; Server-Reflexive &amp;gt; Relayed).&lt;/li>
&lt;li>ICE begins &lt;strong>connectivity checks (STUN Binding Requests)&lt;/strong>. It starts from the highest priority address pair, sending STUN requests to each other. If a request successfully receives a response, that path is considered &lt;strong>valid&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Selecting the Best Path and Starting Media Transmission&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Once a valid path pair is found, the UA can start using it to send media data. But it doesn't stop immediately; it continues to check other possible path pairs.&lt;/li>
&lt;li>When all checks are complete, ICE selects the validated path with the highest priority as the final communication path.&lt;/li>
&lt;li>&lt;strong>Final Result&lt;/strong>:
&lt;ul>
&lt;li>If a Host-to-Host or Host-to-ServerReflexive path works, a P2P (or quasi-P2P) connection is achieved, which is most efficient.&lt;/li>
&lt;li>If all P2P attempts fail, ICE will ultimately choose a path relayed through the TURN server, sacrificing some performance to ensure the successful establishment of the call.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>Through ICE, VoIP systems can intelligently and dynamically adapt to various complex network environments, maximizing attempts to establish efficient P2P connections while gracefully degrading to relay mode when necessary, greatly improving the success rate and quality of VoIP calls.&lt;/p>
&lt;h2 id="6-voip-security-protecting-your-call-privacy">6. VoIP Security: Protecting Your Call Privacy&lt;/h2>
&lt;p>As VoIP becomes more widespread, its security also becomes increasingly important. An unprotected VoIP communication system faces risks of eavesdropping, fraud, and denial of service attacks. Fortunately, we have mature solutions to protect the two key parts of communication: signaling and media.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph UA-A
A[Alice's UA];
end;
subgraph UA-B
B[Bob's UA];
end;
subgraph SIP Proxy
P[Proxy Server];
end;
A -- &amp;quot;SIPS (SIP over TLS)&amp;lt;br&amp;gt;Signaling Encryption&amp;quot; --&amp;gt; P;
P -- &amp;quot;SIPS (SIP over TLS)&amp;lt;br&amp;gt;Signaling Encryption&amp;quot; --&amp;gt; B;
A -- &amp;quot;SRTP&amp;lt;br&amp;gt;Media Encryption&amp;quot; --&amp;gt; B;
style A fill:#D5F5E3;
style B fill:#D5F5E3;
style P fill:#EBF5FB;
&lt;/code>&lt;/pre>
&lt;h3 id="61-signaling-encryption-sips-sip-over-tls">6.1 Signaling Encryption: SIPS (SIP over TLS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Problem&lt;/strong>: Ordinary SIP messages are transmitted in plaintext. Attackers can easily sniff these messages on the network, obtaining metadata such as who the parties in the call are (&lt;code>From`/&lt;/code>To&lt;code> headers), the unique identifier of the call (&lt;/code>Call-ID`), and even tamper with message content, performing call hijacking or fraud.&lt;/li>
&lt;li>&lt;strong>Solution&lt;/strong>: &lt;strong>TLS (Transport Layer Security)&lt;/strong>, the same protocol used by HTTPS to encrypt web traffic.
&lt;ul>
&lt;li>&lt;strong>SIPS (Secure SIP)&lt;/strong>: When SIP runs on top of TLS, it is called SIPS. It encapsulates the entire SIP message (requests and responses) in an encrypted TLS channel for transmission.&lt;/li>
&lt;li>&lt;strong>Working Method&lt;/strong>: The UA and SIP proxy first establish a standard TLS handshake, exchanging certificates and negotiating encryption keys. Once the TLS connection is established, all subsequent SIP messages are transmitted within this encrypted channel, preventing outsiders from peeking at their content.&lt;/li>
&lt;li>&lt;strong>SIP URI&lt;/strong>: Addresses using SIPS are typically represented as &lt;code>sips:alice@example.com&lt;/code> and use port &lt;code>5061&lt;/code> by default instead of &lt;code>5060&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Through SIPS, we ensure the &lt;strong>confidentiality and integrity of call signaling&lt;/strong>.&lt;/p>
&lt;h3 id="62-media-encryption-srtp">6.2 Media Encryption: SRTP&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Problem&lt;/strong>: Even if signaling is encrypted, the actual voice/video data (RTP packets) are still in plaintext by default! Attackers may not know who is on the call, but if they can intercept the RTP stream, they can still eavesdrop on the conversation content.&lt;/li>
&lt;li>&lt;strong>Solution&lt;/strong>: &lt;strong>SRTP (Secure Real-time Transport Protocol)&lt;/strong>, RFC 3711.
&lt;ul>
&lt;li>&lt;strong>Working Method&lt;/strong>: SRTP is not an entirely new protocol but adds a layer of encryption and authentication on top of the RTP protocol. It &lt;strong>encrypts the payload portion of RTP&lt;/strong> but keeps the RTP header in plaintext (because network devices may need to read header information for QoS processing).&lt;/li>
&lt;li>&lt;strong>Key Exchange&lt;/strong>: SRTP itself does not specify how keys are exchanged. In practice, encryption keys are typically negotiated through a secure signaling channel (i.e., SIP/SDP messages encrypted with SIPS/TLS). This process is usually handled by a mechanism called &lt;strong>SDES (SDP Security Descriptions)&lt;/strong> or the more modern &lt;strong>DTLS-SRTP&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Functions&lt;/strong>:
&lt;ol>
&lt;li>&lt;strong>Confidentiality&lt;/strong>: Using symmetric encryption algorithms (such as AES) to encrypt RTP payloads, ensuring that only the communicating parties with the key can decrypt the conversation content.&lt;/li>
&lt;li>&lt;strong>Message Authentication&lt;/strong>: Generating an Authentication Tag through algorithms like HMAC-SHA1. The receiver can use this to verify whether the message has been tampered with during transmission.&lt;/li>
&lt;li>&lt;strong>Replay Protection&lt;/strong>: Preventing attackers from capturing packets and resending them to conduct malicious attacks.&lt;/li>
&lt;/ol>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Alongside SRTP, there is also &lt;strong>SRTCP&lt;/strong>, which provides the same level of encryption and authentication protection for RTCP control packets.&lt;/p>
&lt;p>By combining SIPS and SRTP, we can build an end-to-end secure VoIP communication system, ensuring that the entire process from &amp;ldquo;who is calling&amp;rdquo; to &amp;ldquo;what is said on the phone&amp;rdquo; is tightly protected.&lt;/p>
&lt;h2 id="7-conclusion-and-future-outlook">7. Conclusion and Future Outlook&lt;/h2>
&lt;h3 id="conclusion">Conclusion&lt;/h3>
&lt;p>This document has provided an in-depth analysis of the two core technologies supporting modern network voice communications: VoIP and SIP, from macro to micro perspectives.&lt;/p>
&lt;ul>
&lt;li>We started with the &lt;strong>basic concept of VoIP&lt;/strong>, understanding how it transforms voice into data packets on IP networks, revolutionizing the traditional PSTN system.&lt;/li>
&lt;li>At the &lt;strong>macro level&lt;/strong>, we outlined VoIP's layered technology stack, clarifying the positions and collaborative relationships of key protocols such as SIP (signaling), RTP/RTCP (media), SDP (description), and UDP/TCP (transport).&lt;/li>
&lt;li>At the &lt;strong>micro level&lt;/strong>, we thoroughly analyzed the &lt;strong>SIP protocol&lt;/strong>&amp;lsquo;s core components (UA, Proxy, Registrar), its text message structure similar to HTTP, and the detailed signaling flow of a complete call from registration, establishment to termination. We also understood how &lt;strong>SDP&lt;/strong> negotiates media parameters through the Offer/Answer model.&lt;/li>
&lt;li>We delved into the &lt;strong>RTP protocol&lt;/strong> responsible for carrying actual voice data, understanding the critical importance of sequence numbers and timestamps in its header for handling out-of-order packets, jitter, and achieving synchronization, as well as the key role of &lt;strong>RTCP&lt;/strong> in QoS monitoring.&lt;/li>
&lt;li>We faced the biggest obstacle in real-world network deployment—&lt;strong>NAT&lt;/strong>, and detailed how the &amp;ldquo;three musketeers&amp;rdquo; &lt;strong>STUN, TURN, ICE&lt;/strong> work together to intelligently establish a media path that can penetrate routers.&lt;/li>
&lt;li>Finally, we discussed &lt;strong>VoIP security&lt;/strong> mechanisms, protecting signaling through &lt;strong>SIPS (TLS)&lt;/strong> and media through &lt;strong>SRTP&lt;/strong>, building end-to-end secure communications.&lt;/li>
&lt;/ul>
&lt;h3 id="future-outlook">Future Outlook&lt;/h3>
&lt;p>VoIP technology is far from stopping its development; it is evolving towards being more intelligent, integrated, and seamless.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Deep Integration with WebRTC&lt;/strong>: WebRTC (Web Real-Time Communication) has brought high-quality audio and video communication capabilities directly into browsers. Although WebRTC uses a set of standards based on Javascript API on the browser side, its underlying core concepts (ICE, STUN, TURN, (S)RTP, DTLS-SRTP) are in line with the VoIP technology stack we've discussed. In the future, traditional SIP systems and WebRTC-based applications will be more tightly interconnected, forming a seamless unified communication ecosystem.&lt;/li>
&lt;li>&lt;strong>AI-Empowered Communication Experience&lt;/strong>: Artificial intelligence is reshaping VoIP. For example:
&lt;ul>
&lt;li>&lt;strong>Intelligent Codecs (AI Codec)&lt;/strong>: Using machine learning to reconstruct high-quality voice at extremely low bandwidth.&lt;/li>
&lt;li>&lt;strong>Intelligent Noise Reduction and Echo Cancellation&lt;/strong>: Precisely separating human voice from background noise through AI models, achieving studio-level call quality.&lt;/li>
&lt;li>&lt;strong>Network Path Optimization&lt;/strong>: AI can analyze RTCP data and network telemetry data, predict network congestion, and proactively switch to better servers or network paths.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Immersive Communication&lt;/strong>: With the popularization of 5G and the rise of the metaverse concept, VoIP will no longer be limited to voice and flat video. Spatial Audio, VR/AR calls, and other immersive experiences will place higher demands on VoIP's latency, bandwidth, and synchronization, spurring new technological evolution.&lt;/li>
&lt;/ul>
&lt;p>From electric current on analog telephone lines, to data packets racing through IP networks, to future AI-empowered virtual space conversations, the revolution in communication technology never ceases. A profound understanding of the core technological principles represented by SIP and VoIP will be our solid foundation as we move forward in this wave.&lt;/p></description></item><item><title>Modern ASR Technology Analysis: From Traditional Models to LLM-Driven New Paradigms</title><link>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</link><pubDate>Sat, 28 Jun 2025 13:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/asr-technology-overview/</guid><description>&lt;h2 id="1-background">1. Background&lt;/h2>
&lt;h3 id="11-pain-points-of-traditional-asr-models">1.1 Pain Points of Traditional ASR Models&lt;/h3>
&lt;p>Traditional Automatic Speech Recognition (ASR) models, such as those based on Hidden Markov Models-Gaussian Mixture Models (HMM-GMM) or Deep Neural Networks (DNN), perform well in specific domains and controlled environments but face numerous challenges:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Sparsity&lt;/strong>: Heavy dependence on large-scale, high-quality labeled datasets, resulting in poor generalization to low-resource languages or specific accents.&lt;/li>
&lt;li>&lt;strong>Insufficient Robustness&lt;/strong>: Performance drops dramatically in noisy environments, far-field audio capture, multi-person conversations, and other real-world scenarios.&lt;/li>
&lt;li>&lt;strong>Lack of Contextual Understanding&lt;/strong>: Models are typically limited to direct mapping from acoustic features to text, lacking understanding of long-range context, semantics, and speaker intent, leading to recognition errors (such as homophone confusion).&lt;/li>
&lt;li>&lt;strong>Limited Multi-task Capabilities&lt;/strong>: Traditional models are usually single-task oriented, supporting only speech transcription without simultaneously handling speaker diarization, language identification, translation, and other tasks.&lt;/li>
&lt;/ol>
&lt;h3 id="12-large-language-model-llm-driven-asr-new-paradigm">1.2 Large Language Model (LLM) Driven ASR New Paradigm&lt;/h3>
&lt;p>In recent years, end-to-end large ASR models represented by &lt;code>Whisper&lt;/code> have demonstrated unprecedented robustness and generalization capabilities through pretraining on massive, diverse unsupervised or weakly supervised data. These models typically adopt an Encoder-Decoder architecture, treating ASR as a sequence-to-sequence translation problem.&lt;/p>
&lt;p>&lt;strong>Typical Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Raw Audio Waveform&amp;quot;] --&amp;gt; B[&amp;quot;Feature Extraction (e.g., Log-Mel Spectrogram)&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Sequence Output&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>This approach not only simplifies the complex pipeline of traditional ASR but also learns rich acoustic and linguistic knowledge through large-scale data, enabling excellent performance even in zero-shot scenarios.&lt;/p>
&lt;h2 id="2-analysis-of-asr-model-solutions">2. Analysis of ASR Model Solutions&lt;/h2>
&lt;h3 id="21-whisperlargev3turbo">2.1 Whisper-large-v3-turbo&lt;/h3>
&lt;p>&lt;code>Whisper&lt;/code> is a pretrained ASR model developed by OpenAI, with its &lt;code>large-v3&lt;/code> and &lt;code>large-v3-turbo&lt;/code> versions being among the industry-leading models.&lt;/p>
&lt;h4 id="211-whisper-design">2.1.1 Whisper Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Input (30s segment)&amp;quot;] --&amp;gt; B[&amp;quot;Log-Mel Spectrogram&amp;quot;]
B --&amp;gt; C[&amp;quot;Transformer Encoder&amp;quot;]
C --&amp;gt; D[&amp;quot;Encoded Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Transformer Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Predicted Text Tokens&amp;quot;]
subgraph &amp;quot;Multi-task Processing&amp;quot;
E --&amp;gt; G[&amp;quot;Transcription&amp;quot;]
E --&amp;gt; H[&amp;quot;Translation&amp;quot;]
E --&amp;gt; I[&amp;quot;Language Identification&amp;quot;]
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Large-scale Weakly Supervised Training&lt;/strong>: Trained on 680,000 hours of multilingual, multi-task data, covering a wide range of accents, background noise, and technical terminology.&lt;/li>
&lt;li>&lt;strong>End-to-end Architecture&lt;/strong>: A unified Transformer model directly maps audio to text, without requiring external language models or alignment modules.&lt;/li>
&lt;li>&lt;strong>Multi-task Capability&lt;/strong>: The model can simultaneously handle multilingual speech transcription, speech translation, and language identification.&lt;/li>
&lt;li>&lt;strong>Robustness&lt;/strong>: Through carefully designed data augmentation and mixing, the model performs excellently under various challenging conditions.&lt;/li>
&lt;li>&lt;strong>Turbo Version&lt;/strong>: &lt;code>large-v3-turbo&lt;/code> is an optimized version of &lt;code>large-v3&lt;/code>, potentially offering improvements in inference speed, computational efficiency, or specific task performance, with approximately 798M parameters.&lt;/li>
&lt;/ul>
&lt;h4 id="212-problems-solved">2.1.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>Whisper's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Poor Generalization&lt;/td>
&lt;td>Large-scale pretraining on massive, diverse datasets covering nearly a hundred languages.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Insufficient Robustness&lt;/td>
&lt;td>Training data includes various background noise, accents, and speaking styles, enhancing performance in real-world scenarios.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Weak Contextual Modeling&lt;/td>
&lt;td>Transformer architecture captures long-range dependencies in audio signals.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Complex Deployment&lt;/td>
&lt;td>Provides multiple model sizes (from &lt;code>tiny&lt;/code> to &lt;code>large&lt;/code>), with open-sourced code and model weights, facilitating community use and deployment.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="213-production-defect-analysis">2.1.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2131-hallucination-issues">2.1.3.1 Hallucination Issues&lt;/h5>
&lt;ul>
&lt;li>In segments with no speech or noise, the model sometimes generates meaningless or repetitive text, a common issue with large autoregressive models.&lt;/li>
&lt;li>This phenomenon is particularly noticeable in long audio processing and may require additional post-processing logic for detection and filtering.&lt;/li>
&lt;/ul>
&lt;h5 id="2132-limited-timestamp-precision">2.1.3.2 Limited Timestamp Precision&lt;/h5>
&lt;ul>
&lt;li>The model predicts word-level timestamps, but their precision may not meet the stringent requirements of certain applications (such as subtitle alignment, speech editing).&lt;/li>
&lt;li>Timestamp accuracy decreases during long periods of silence or rapid speech flow.&lt;/li>
&lt;/ul>
&lt;h5 id="2133-high-computational-resource-requirements">2.1.3.3 High Computational Resource Requirements&lt;/h5>
&lt;ul>
&lt;li>The &lt;code>large-v3&lt;/code> model contains 1.55 billion parameters, and the &lt;code>turbo&lt;/code> version has nearly 800 million parameters, demanding significant computational resources (especially GPU memory), making it unsuitable for direct execution on edge devices.&lt;/li>
&lt;li>Although optimization techniques like quantization exist, balancing performance while reducing resource consumption remains a challenge.&lt;/li>
&lt;/ul>
&lt;h5 id="2134-realtime-processing-bottlenecks">2.1.3.4 Real-time Processing Bottlenecks&lt;/h5>
&lt;ul>
&lt;li>The model processes 30-second audio windows, requiring complex sliding window and caching mechanisms for real-time streaming ASR scenarios, which introduces additional latency.&lt;/li>
&lt;/ul>
&lt;h3 id="22-sensevoice">2.2 SenseVoice&lt;/h3>
&lt;p>&lt;code>SenseVoice&lt;/code> is a next-generation industrial-grade ASR model developed by Alibaba DAMO Academy's speech team. Unlike &lt;code>Whisper&lt;/code>, which focuses on robust general transcription, &lt;code>SenseVoice&lt;/code> emphasizes multi-functionality, real-time processing, and integration with downstream tasks.&lt;/p>
&lt;h4 id="221-sensevoice-design">2.2.1 SenseVoice Design&lt;/h4>
&lt;p>&lt;strong>Structural Modules&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Stream&amp;quot;] --&amp;gt; B[&amp;quot;FSMN-VAD (Voice Activity Detection)&amp;quot;]
B --&amp;gt; C[&amp;quot;Encoder (e.g., SAN-M)&amp;quot;]
C --&amp;gt; D[&amp;quot;Latent Representation&amp;quot;]
D --&amp;gt; E[&amp;quot;Decoder&amp;quot;]
E --&amp;gt; F[&amp;quot;Text Output&amp;quot;]
subgraph &amp;quot;Multi-task and Control&amp;quot;
G[&amp;quot;Speaker Diarization&amp;quot;] --&amp;gt; C
H[&amp;quot;Emotion Recognition&amp;quot;] --&amp;gt; C
I[&amp;quot;Zero-shot TTS Prompt&amp;quot;] --&amp;gt; E
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Features&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified End-to-end Model&lt;/strong>: Integrates acoustic model, language model, and punctuation prediction, achieving end-to-end output from speech to punctuated text.&lt;/li>
&lt;li>&lt;strong>Multi-task Learning&lt;/strong>: The model not only performs speech recognition but also simultaneously outputs speaker diarization, emotional information, and can even generate acoustic prompts for zero-shot TTS.&lt;/li>
&lt;li>&lt;strong>Streaming and Non-streaming Integration&lt;/strong>: Supports both streaming and non-streaming modes through a unified architecture, meeting the needs of real-time and offline scenarios.&lt;/li>
&lt;li>&lt;strong>TTS Integration&lt;/strong>: One innovation of &lt;code>SenseVoice&lt;/code> is that its output can serve as a prompt for TTS models like &lt;code>CosyVoice&lt;/code>, enabling voice cloning and transfer, closing the loop between ASR and TTS.&lt;/li>
&lt;/ul>
&lt;h4 id="222-problems-solved">2.2.2 Problems Solved&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Problem&lt;/th>
&lt;th>SenseVoice's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Single-task Limitation, Integration Difficulties&lt;/td>
&lt;td>Designed as a multi-task model, natively supporting speaker diarization, emotion recognition, etc., simplifying dialogue system construction.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor Real-time Performance&lt;/td>
&lt;td>Adopts efficient streaming architecture (such as SAN-M), combined with VAD, achieving low-latency real-time recognition.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of Coordination with Downstream Tasks&lt;/td>
&lt;td>Output includes rich meta-information (such as speaker, emotion) and can generate TTS prompts, achieving deep integration between ASR and TTS.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Punctuation Restoration Dependent on Post-processing&lt;/td>
&lt;td>Incorporates punctuation prediction as a built-in task, achieving joint modeling of text and punctuation.&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="223-production-defect-analysis">2.2.3 Production Defect Analysis&lt;/h4>
&lt;h5 id="2231-model-complexity-and-maintenance">2.2.3.1 Model Complexity and Maintenance&lt;/h5>
&lt;ul>
&lt;li>As a complex model integrating multiple functions, its training and maintenance costs are relatively high.&lt;/li>
&lt;li>Balancing multiple tasks may require fine-tuning to avoid performance degradation in any single task.&lt;/li>
&lt;/ul>
&lt;h5 id="2232-generalization-of-zeroshot-capabilities">2.2.3.2 Generalization of Zero-shot Capabilities&lt;/h5>
&lt;ul>
&lt;li>Although it supports zero-shot TTS prompt generation, its voice cloning effect and stability when facing unseen speakers or complex acoustic environments may not match specialized voice cloning models.&lt;/li>
&lt;/ul>
&lt;h5 id="2233-opensource-ecosystem-and-community">2.2.3.3 Open-source Ecosystem and Community&lt;/h5>
&lt;ul>
&lt;li>Compared to &lt;code>Whisper&lt;/code>'s strong open-source community and rich ecosystem tools, &lt;code>SenseVoice&lt;/code>, as an industrial-grade model, may have limited open-source availability and community support, affecting its popularity in academic and developer communities.&lt;/li>
&lt;/ul>
&lt;h2 id="3-conclusion">3. Conclusion&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Whisper&lt;/strong>: Through large-scale weakly supervised learning, it has pushed the robustness and generalization capabilities of ASR to new heights. It is a powerful &lt;strong>general-purpose speech recognizer&lt;/strong>, particularly suitable for processing diverse, uncontrolled audio data. Its design philosophy is &amp;ldquo;trading scale for performance,&amp;rdquo; excelling in zero-shot and multilingual scenarios.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>SenseVoice&lt;/strong>: Represents the trend of ASR technology developing towards &lt;strong>multi-functionality and integration&lt;/strong>. It is not just a recognizer but a &lt;strong>perceptual frontend for conversational intelligence&lt;/strong>, aimed at providing richer, more real-time input for downstream tasks (such as dialogue systems, TTS). Its design philosophy is &amp;ldquo;fusion and collaboration,&amp;rdquo; emphasizing ASR's pivotal role in the entire intelligent interaction chain.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>In summary, &lt;code>Whisper&lt;/code> defines the performance baseline for modern ASR, while &lt;code>SenseVoice&lt;/code> explores broader possibilities for ASR in industrial applications. Future ASR technology may develop towards combining the strengths of both: having both the robustness and generalization capabilities of &lt;code>Whisper&lt;/code> and the multi-task collaboration and real-time processing capabilities of &lt;code>SenseVoice&lt;/code>.&lt;/p></description></item><item><title>Modern TTS Architecture Comparison: In-Depth Analysis of Ten Speech Synthesis Models</title><link>https://ziyanglin.netlify.app/en/post/modern-tts-models/</link><pubDate>Fri, 27 Jun 2025 07:02:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/modern-tts-models/</guid><description>&lt;h2 id="1-kokoro-lightweight-efficient-tts">1. Kokoro: Lightweight Efficient TTS&lt;/h2>
&lt;h3 id="11-architecture-design">1.1 Architecture Design&lt;/h3>
&lt;p>Kokoro adopts a concise and efficient architecture design, with its core structure as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[G2P Phoneme Processing - misaki]
B --&amp;gt; C[StyleTTS2 Style Decoder]
C --&amp;gt; D[ISTFTNet Vocoder]
D --&amp;gt; E[Waveform - 24kHz]
&lt;/code>&lt;/pre>
&lt;p>Kokoro's features:&lt;/p>
&lt;ul>
&lt;li>No traditional Encoder (directly processes phonemes)&lt;/li>
&lt;li>Decoder uses feed-forward non-recursive structure (Conv1D/FFN)&lt;/li>
&lt;li>Does not use transformer, autoregression, or diffusion&lt;/li>
&lt;li>Style and prosody are injected as conditional vectors in the decoder&lt;/li>
&lt;li>Uses ISTFTNet as vocoder: lightweight, fast, supports ONNX inference&lt;/li>
&lt;/ul>
&lt;h3 id="12-technical-advantages">1.2 Technical Advantages&lt;/h3>
&lt;p>Kokoro provides solutions to multiple pain points of traditional TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Issue&lt;/th>
&lt;th>Kokoro's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Limited voice style diversity&lt;/td>
&lt;td>Built-in style embedding and multiple speaker options (48+)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High deployment threshold&lt;/td>
&lt;td>Full Python/PyTorch + ONNX support, one-line pip installation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Slow generation speed&lt;/td>
&lt;td>Uses non-autoregressive structure + lightweight vocoder (ISTFTNet)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of control capability&lt;/td>
&lt;td>Explicitly models pitch/duration/energy and other prosody parameters&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Unclear licensing&lt;/td>
&lt;td>Uses Apache 2.0, commercial-friendly and fine-tunable&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="13-limitation-analysis">1.3 Limitation Analysis&lt;/h3>
&lt;p>Despite Kokoro's excellence in efficiency and deployment convenience, it has some notable limitations:&lt;/p>
&lt;h4 id="131-strong-structural-parallelism-but-weak-context-modeling">1.3.1 Strong Structural Parallelism but Weak Context Modeling&lt;/h4>
&lt;ul>
&lt;li>No encoder → Cannot understand whole-sentence context, e.g., &amp;ldquo;He is happy today&amp;rdquo; vs &amp;ldquo;He is angry today&amp;rdquo; cannot naturally vary in intonation&lt;/li>
&lt;li>Phonemes are sent directly to the decoder, without linguistic hierarchical structure&lt;/li>
&lt;li>In long texts or sentences with strong contextual dependencies, pause rhythm lacks semantic awareness&lt;/li>
&lt;li>Parallel generation can produce output at once without token-by-token inference, but semantic consistency is poor and cannot simulate paragraph tone progression&lt;/li>
&lt;/ul>
&lt;h4 id="132-limited-acoustic-modeling-capability">1.3.2 Limited Acoustic Modeling Capability&lt;/h4>
&lt;ul>
&lt;li>Sound details (such as breathiness, intonation contour) are not as good as VALL-E, StyleTTS2, Bark&lt;/li>
&lt;li>Uses the classic TTS route of &amp;ldquo;decoder predicts Mel + vocoder synthesis,&amp;rdquo; acoustic precision is approaching its upper limit&lt;/li>
&lt;li>Prosody prediction is controllable but limited in quality (model itself is too small)&lt;/li>
&lt;/ul>
&lt;h4 id="133-tradeoff-between-audio-quality-and-model-complexity">1.3.3 Trade-off Between Audio Quality and Model Complexity&lt;/h4>
&lt;ul>
&lt;li>Sacrifices some audio quality to maintain speed&lt;/li>
&lt;li>May produce artifacts in high-frequency bands, nasal sounds, and plosives&lt;/li>
&lt;li>Limited emotional expression intensity, cannot do &amp;ldquo;roaring, crying&amp;rdquo; and other extreme styles&lt;/li>
&lt;/ul>
&lt;h2 id="2-cosyvoice-llmbased-unified-architecture">2. CosyVoice: LLM-Based Unified Architecture&lt;/h2>
&lt;h3 id="21-architecture-design">2.1 Architecture Design&lt;/h3>
&lt;p>CosyVoice adopts a unified architecture design similar to LLMs, integrating text and audio processing into a single framework:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[Tokenizer]
B --&amp;gt; C[Text token]
D[Audio] --&amp;gt; E[WavTokenizer]
E --&amp;gt; F[Acoustic token]
C --&amp;gt; G[LLaMA Transformer]
G1[Prosody token] --&amp;gt; G
G2[Speaker prompt] --&amp;gt; G
F --&amp;gt; G
G --&amp;gt; H[Predict Acoustic token]
H --&amp;gt; I[Vocoder]
I --&amp;gt; J[Audio output]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Implementation Details&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Tokenizer&lt;/td>
&lt;td>Uses standard BPE tokenizer, converts text to tokens (supports Chinese-English mixed input)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>WavTokenizer&lt;/td>
&lt;td>Discretizes audio into tokens (replacing traditional Mel), interfaces with Transformer decoder&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Transformer Model&lt;/td>
&lt;td>Multimodal autoregressive Transformer, structure similar to LLaMA, fuses text and audio tokens&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Prosody Token&lt;/td>
&lt;td>Controls &amp;lt;laugh&amp;gt; &amp;lt;pause&amp;gt; &amp;lt;whisper&amp;gt; and other tones through token insertion rather than model structure modeling&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder&lt;/td>
&lt;td>Supports HiFi-GAN or SNAC: restores waveforms from audio tokens, lightweight, supports low-latency deployment&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="22-technical-advantages">2.2 Technical Advantages&lt;/h3>
&lt;p>CosyVoice provides innovative solutions to multiple issues in traditional TTS architectures:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Issue&lt;/th>
&lt;th>CosyVoice's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Complex traditional structure, slow inference&lt;/td>
&lt;td>Uses unified Transformer architecture, no encoder, direct token input/output, simplified structure&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of prosody control&lt;/td>
&lt;td>Inserts prosody tokens (like &amp;lt;laugh&amp;gt;) for expression control, no need to train dedicated emotion models&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Upstream/downstream inconsistency, uncontrollable TTS&lt;/td>
&lt;td>Both text and audio are discretized into tokens, unified modeling logic, supports prompt guidance and controllable generation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High difficulty in multilingual modeling&lt;/td>
&lt;td>Supports Chinese-English bilingual training, text tokenizer natively supports multiple languages, unified expression at token layer&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of conversational speech capability&lt;/td>
&lt;td>Generation method compatible with LLMs, can integrate chat context to construct speech dialogue system framework&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="23-limitation-analysis">2.3 Limitation Analysis&lt;/h3>
&lt;p>While CosyVoice has significant advantages in unified architecture and flexibility, it also faces some challenges in practical applications:&lt;/p>
&lt;h4 id="231-autoregressive-structure-leads-to-low-parallelism">2.3.1 Autoregressive Structure Leads to Low Parallelism&lt;/h4>
&lt;ul>
&lt;li>Model uses LLM-like token-by-token autoregressive generation method&lt;/li>
&lt;li>Must generate sequentially, cannot process long sentences in parallel&lt;/li>
&lt;li>Inference speed significantly slower than non-autoregressive models like Fastspeech2/StyleTTS2&lt;/li>
&lt;li>Fundamental limitation comes from Transformer decoder architecture: must wait for previous token generation before predicting the next one&lt;/li>
&lt;/ul>
&lt;h4 id="232-prosody-control-mechanism-relies-on-prompts-not-suitable-for-stable-production">2.3.2 Prosody Control Mechanism Relies on Prompts, Not Suitable for Stable Production&lt;/h4>
&lt;ul>
&lt;li>Style control depends on manual insertion of prosody tokens&lt;/li>
&lt;li>Style output quality highly dependent on &amp;ldquo;prompt crafting techniques&amp;rdquo;&lt;/li>
&lt;li>Compared to StyleTTS2's direct input of style vector/embedding, control is less structured, lacking learnability and robustness&lt;/li>
&lt;li>Difficult to automatically build stable output flow in engineering&lt;/li>
&lt;/ul>
&lt;h4 id="233-lacks-speaker-transfer-capability">2.3.3 Lacks Speaker Transfer Capability&lt;/h4>
&lt;ul>
&lt;li>No explicit support for speaker embedding&lt;/li>
&lt;li>Cannot implement voice cloning through reference audio&lt;/li>
&lt;li>Capability clearly insufficient when highly personalized speech is needed (e.g., virtual characters, customer-customized voices)&lt;/li>
&lt;/ul>
&lt;h2 id="3-chattts-modular-diffusion-model">3. ChatTTS: Modular Diffusion Model&lt;/h2>
&lt;h3 id="31-architecture-design">3.1 Architecture Design&lt;/h3>
&lt;p>ChatTTS adopts a modular design approach, combining the advantages of diffusion models:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[Text Encoder]
B --&amp;gt; C[Latent Diffusion Duration Predictor - LDDP]
C --&amp;gt; D[Acoustic Encoder - generates speech tokens]
D --&amp;gt; E[HiFi-GAN vocoder]
E --&amp;gt; F[Audio]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Implementation Details&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Tokenizer&lt;/td>
&lt;td>Uses standard BPE tokenizer, converts text to tokens (supports Chinese-English mixed input)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>WavTokenizer&lt;/td>
&lt;td>Discretizes audio into tokens (replacing Mel), as decoder target&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Text Encoder&lt;/td>
&lt;td>Encodes text tokens, provides context vector representation for subsequent modules&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Duration Predictor (LDDP)&lt;/td>
&lt;td>Uses diffusion model to predict token duration, achieving natural prosody (rhythm modeling)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Acoustic Decoder&lt;/td>
&lt;td>Autoregressively generates speech tokens, constructing speech representation frame by frame&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Prosody Token&lt;/td>
&lt;td>Controls &amp;lt;laugh&amp;gt; &amp;lt;pause&amp;gt; &amp;lt;shout&amp;gt; and other tokens, incorporating sentence expression tone and rhythm&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder&lt;/td>
&lt;td>Supports HiFi-GAN/EnCodec, restores waveforms from speech tokens, flexible deployment&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="32-technical-advantages">3.2 Technical Advantages&lt;/h3>
&lt;p>ChatTTS provides solutions to module dependency and inference pipeline issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Issue&lt;/th>
&lt;th>ChatTTS's Strategy&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Heavy module dependencies&lt;/td>
&lt;td>Decouples modules for modular training: supports independent training of tokenizer, diffusion-based duration model, vocoder, and connects through intermediate tokens, reducing end-to-end coupling risk&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Long inference pipeline&lt;/td>
&lt;td>Uses unified token expression structure (text token → speech token → waveform), forming standard token flow path, enhancing module collaboration efficiency; supports HiFi-GAN to simplify backend&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High fine-tuning difficulty&lt;/td>
&lt;td>Explicit control logic: expresses style through prosody token insertion, no need for additional style models, reducing data dependency and fine-tuning complexity&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="33-limitation-analysis">3.3 Limitation Analysis&lt;/h3>
&lt;p>ChatTTS has advantages in modular design but also faces some practical application challenges:&lt;/p>
&lt;h4 id="331-autoregressive-structure-leads-to-low-parallelism">3.3.1 Autoregressive Structure Leads to Low Parallelism&lt;/h4>
&lt;ul>
&lt;li>Uses Transformer Decoder + autoregressive mechanism, generating tokens one by one&lt;/li>
&lt;li>Must wait for the completion of the previous speech token before generating the next one&lt;/li>
&lt;/ul>
&lt;h4 id="332-complex-architecture-multiple-modules-high-maintenance-difficulty">3.3.2 Complex Architecture, Multiple Modules, High Maintenance Difficulty&lt;/h4>
&lt;ul>
&lt;li>Heavy module dependencies: includes tokenizer, diffusion predictor, decoder, vocoder, and other components, difficult to train and optimize uniformly&lt;/li>
&lt;li>Long inference pipeline: errors in any module will affect speech quality and timing control&lt;/li>
&lt;li>High fine-tuning difficulty: control tokens and style embedding effects have strong data dependency&lt;/li>
&lt;/ul>
&lt;h4 id="333-control-tokens-have-weak-interpretability-generation-is-unstable">3.3.3 Control Tokens Have Weak Interpretability, Generation Is Unstable&lt;/h4>
&lt;ul>
&lt;li>Control tokens lack standardization, e.g., [laugh], [pause], [sad] insertions show inconsistent performance, requiring manual parameter tuning&lt;/li>
&lt;li>Token combination effects are complex, multiple control tokens combined may produce unexpected speech effects (such as rhythm disorder)&lt;/li>
&lt;/ul>
&lt;h2 id="4-chatterbox-multimodule-fusion-model">4. Chatterbox: Multi-Module Fusion Model&lt;/h2>
&lt;h3 id="41-architecture-design">4.1 Architecture Design&lt;/h3>
&lt;p>Chatterbox adopts a multi-module fusion design approach, combining various advanced technologies:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[Semantic token encoding]
B --&amp;gt; C[s3gen generates speech tokens]
C --&amp;gt; D[cosyvoice decoding]
D --&amp;gt; E[HiFi-GAN]
E --&amp;gt; F[Audio output]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Algorithm Approach&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Text Encoder (LLM)&lt;/td>
&lt;td>Uses language model (like LLaMA) to encode text&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>s3gen (Speech Semantic Sequence Generator)&lt;/td>
&lt;td>Mimics VALL-E concept, predicts discrete speech tokens&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>t3_cfg (TTS Config)&lt;/td>
&lt;td>Model structure definition, including vocoder type, tokenizer configuration, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>CosyVoice (Decoder)&lt;/td>
&lt;td>Non-autoregressive decoder&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>HiFi-GAN (Vocoder)&lt;/td>
&lt;td>Convolutional + discriminator generator network&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="42-technical-advantages">4.2 Technical Advantages&lt;/h3>
&lt;p>Chatterbox provides solutions to multiple issues in traditional TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Issue&lt;/th>
&lt;th>Chatterbox's Strategy&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Difficult prosody control&lt;/td>
&lt;td>Inserts prosody tokens for expression control, no need for additional labels or gating models&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Text and speech structure separation&lt;/td>
&lt;td>Uses discrete speech tokens to connect to unified token pipeline, enhancing upstream-downstream coordination&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor multilingual support&lt;/td>
&lt;td>Supports native Chinese-English mixed input, unified token layer expression structure&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Lack of context/dialogue support&lt;/td>
&lt;td>Integrates LLM output token sequences, laying foundation for dialogue speech framework&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="43-limitation-analysis">4.3 Limitation Analysis&lt;/h3>
&lt;p>Chatterbox has innovations in multi-module fusion but also faces some practical application challenges:&lt;/p>
&lt;h4 id="431-intermediate-tokens-lack-transparency">4.3.1 Intermediate Tokens Lack Transparency&lt;/h4>
&lt;ul>
&lt;li>s3gen's speech tokens lack clear interpretability, not conducive to later debugging and control of tone, emotion, and other attributes&lt;/li>
&lt;/ul>
&lt;h4 id="432-insufficient-context-management-capability">4.3.2 Insufficient Context Management Capability&lt;/h4>
&lt;ul>
&lt;li>Current design tends toward single-round inference, does not support long dialogue caching, difficult to use in multi-round voice dialogue agent scenarios&lt;/li>
&lt;/ul>
&lt;h4 id="433-long-chain-dependent-on-multiple-modules">4.3.3 Long Chain, Dependent on Multiple Modules&lt;/h4>
&lt;ul>
&lt;li>Multi-module combination (LLM + s3gen + CosyVoice + vocoder), overall model robustness decreases, difficult to optimize as a whole&lt;/li>
&lt;/ul>
&lt;h2 id="5-dia-lightweight-crossplatform-tts">5. Dia: Lightweight Cross-Platform TTS&lt;/h2>
&lt;h3 id="51-architecture-design">5.1 Architecture Design&lt;/h3>
&lt;p>Dia adopts a lightweight design suitable for cross-platform deployment:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[Tokenizer]
B --&amp;gt; C[Text Encoder - GPT-style]
C --&amp;gt; D[Prosody Module]
D --&amp;gt; E[Acoustic Decoder - generates speech tokens]
E --&amp;gt; F{Vocoder}
F --&amp;gt;|HiFi-GAN| G[Audio]
F --&amp;gt;|SNAC| G
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Text Encoder&lt;/td>
&lt;td>Mostly GPT-style structures, modeling input text; captures context semantics and intonation cues&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Prosody Module&lt;/td>
&lt;td>Controls tone, rhythm, emotional state (possibly embedding + classifier)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Decoder&lt;/td>
&lt;td>Maps encoded semantics to acoustic tokens (possibly codec representation or Mel features)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder&lt;/td>
&lt;td>Commonly uses HiFi-GAN, converts acoustic tokens to playable audio (.wav or .mp3)&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="52-technical-advantages">5.2 Technical Advantages&lt;/h3>
&lt;p>Dia provides solutions to multiple issues in TTS deployment and cross-platform applications:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Issue&lt;/th>
&lt;th>dia-gguf's Strategy&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Lack of natural dialogue intonation&lt;/td>
&lt;td>Introduces prosody tokens (like &amp;lt;laugh&amp;gt;, &amp;lt;pause&amp;gt;, etc.) to express tonal changes, building dialogue-aware pronunciation style&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High inference threshold, complex deployment&lt;/td>
&lt;td>Through GGUF format encapsulation + multi-level quantization (Q2/Q4/Q6/F16), supports offline running on CPU, no need for specialized GPU&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Fragmented model deployment formats&lt;/td>
&lt;td>Uses GGUF standard format to encapsulate model parameters and structure information, compatible with TTS.cpp/gguf-connector and other frameworks, achieving cross-platform operation&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="53-limitation-analysis">5.3 Limitation Analysis&lt;/h3>
&lt;p>Dia has advantages in lightweight and cross-platform deployment but also faces some practical application challenges:&lt;/p>
&lt;h4 id="531-acoustic-decoder-may-become-a-bottleneck">5.3.1 Acoustic Decoder May Become a Bottleneck&lt;/h4>
&lt;ul>
&lt;li>If using high-fidelity decoders (such as VQ-VAE or GAN-based vocoders), inference phase efficiency depends on the vocoder itself&lt;/li>
&lt;li>Current gguf‑connector is mainly implemented in C++, not as efficient as GPU-side HiFi-GAN&lt;/li>
&lt;/ul>
&lt;h4 id="532-lacks-flexible-style-transfer-mechanism">5.3.2 Lacks Flexible Style Transfer Mechanism&lt;/h4>
&lt;ul>
&lt;li>Current version mainly targets single dialogue style, does not support style transfer or emotion control in multi-speaker, multi-emotion scenarios&lt;/li>
&lt;li>No encoder-decoder separation structure, limiting style transfer scalability&lt;/li>
&lt;/ul>
&lt;h4 id="533-clear-tradeoff-between-precision-and-naturalness">5.3.3 Clear Trade-off Between Precision and Naturalness&lt;/h4>
&lt;ul>
&lt;li>Low-bit quantization (like Q2) is fast for inference but prone to speech fragmentation and detail loss, not suitable for high-fidelity scenarios&lt;/li>
&lt;li>If deployed in voice assistant or announcer systems, user experience will decline for audio quality-sensitive users&lt;/li>
&lt;/ul>
&lt;h2 id="6-orpheus-llmbased-endtoend-tts">6. Orpheus: LLM-Based End-to-End TTS&lt;/h2>
&lt;h3 id="61-architecture-design">6.1 Architecture Design&lt;/h3>
&lt;p>Orpheus adopts an end-to-end design approach based on LLMs:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text Prompt + Emotion tokens] --&amp;gt; B[LLaMA 3B - finetune]
B --&amp;gt; C[Generate audio tokens - discretized speech representation]
C --&amp;gt; D[SNAC decoder]
D --&amp;gt; E[Reconstruct audio waveform]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>LLaMA 3B Structure&lt;/strong>: The foundation is Meta's Transformer architecture, with Orpheus performing SFT (Supervised Finetuning) to learn audio token prediction&lt;/li>
&lt;li>&lt;strong>Tokenization&lt;/strong>: Uses audio codec from the SoundStorm series to discretize audio (similar to VQVAE) forming training targets&lt;/li>
&lt;li>&lt;strong>Output Form&lt;/strong>: The model's final stage predicts multiple audio token sequences (token-class level autoregression), which can be concatenated to reconstruct speech&lt;/li>
&lt;li>&lt;strong>Decoder&lt;/strong>: Uses SNAC (Streaming Non-Autoregressive Codec) to decode audio tokens into final waveform&lt;/li>
&lt;/ul>
&lt;h4 id="snac-decoder-in-detail">SNAC Decoder in Detail&lt;/h4>
&lt;p>SNAC (Spectral Neural Audio Codec) is a neural network audio codec used in TTS models to convert audio codes into actual audio waveforms.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Orpheus audio codes] --&amp;gt; B[Code redistribution]
B --&amp;gt; C[SNAC three-layer decoding]
C --&amp;gt; D[PCM audio waveform]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Basic Concept&lt;/strong>&lt;/p>
&lt;p>SNAC is a neural network audio decoder specifically designed for TTS models. It receives discrete audio codes generated by TTS models (such as Orpheus) and converts these codes into high-quality 24kHz audio waveforms. SNAC's main feature is its ability to efficiently process hierarchically encoded audio information and generate natural, fluent speech.&lt;/p>
&lt;p>&lt;strong>Technical Architecture&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Layered Structure&lt;/strong>: SNAC uses a 3-layer structure to process audio information, while the Orpheus model generates 7-layer audio codes. This requires code redistribution.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Code Redistribution Mapping&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>SNAC layer 0 receives Orpheus layer 0 codes&lt;/li>
&lt;li>SNAC layer 1 receives Orpheus layers 1 and 4 codes (interleaved)&lt;/li>
&lt;li>SNAC layer 2 receives Orpheus layers 2, 3, 5, and 6 codes (interleaved)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Decoding Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code>Orpheus audio codes → Code redistribution → SNAC three-layer decoding → PCM audio waveform
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Implementation Methods&lt;/strong>&lt;/p>
&lt;p>SNAC has two main implementation methods:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>PyTorch Implementation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Uses the original PyTorch model for decoding&lt;/li>
&lt;li>Suitable for environments without ONNX support&lt;/li>
&lt;li>Relatively slower decoding speed&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>ONNX Optimized Implementation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Uses pre-trained models in ONNX (Open Neural Network Exchange) format&lt;/li>
&lt;li>Supports hardware acceleration (CUDA or CPU)&lt;/li>
&lt;li>Provides quantized versions, reducing model size and improving inference speed&lt;/li>
&lt;li>Better real-time performance (higher RTF - Real Time Factor)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Code Processing Flow&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Code Validation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Checks if codes are within valid range&lt;/li>
&lt;li>Ensures the number of codes is a multiple of ORPHEUS_N_LAYERS (7)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Code Padding&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>If the number of codes is not a multiple of 7, automatic padding is applied&lt;/li>
&lt;li>Uses the last valid code or default code for padding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Code Redistribution&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Remaps 7-layer Orpheus codes to 3-layer SNAC codes&lt;/li>
&lt;li>Follows specific mapping rules&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Decoding&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Uses the SNAC model (PyTorch or ONNX) to convert redistributed codes into audio waveforms&lt;/li>
&lt;li>Outputs 24kHz sample rate mono PCM audio data&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Role in TTS Models&lt;/strong>&lt;/p>
&lt;p>SNAC plays a key role in the entire TTS workflow:&lt;/p>
&lt;ol>
&lt;li>The TTS model (Orpheus) generates audio codes&lt;/li>
&lt;li>The SNAC decoder converts these codes into actual audio waveforms&lt;/li>
&lt;li>The audio waveform undergoes post-processing (such as fade in/out, gain adjustment, watermarking, etc.)&lt;/li>
&lt;li>The final audio is encoded in Opus format and transmitted to the client via HTTP or WebSocket&lt;/li>
&lt;/ol>
&lt;p>SNAC's efficient decoding capability is one of the key technologies for achieving low-latency, high-quality streaming TTS, enabling the model to respond to user requests in real time.&lt;/p>
&lt;h3 id="62-technical-advantages">6.2 Technical Advantages&lt;/h3>
&lt;p>Orpheus provides innovative solutions to multiple issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Issue&lt;/th>
&lt;th>Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Complex multi-module deployment&lt;/td>
&lt;td>Integrates TTS into LLM, builds single-model structure, directly generates audio tokens&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High inference latency&lt;/td>
&lt;td>Uses low-bit quantization (Q4_K_M), combined with GGUF format, accelerating inference&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Uncontrollable emotions&lt;/td>
&lt;td>Introduces &amp;lt;laugh&amp;gt;, &amp;lt;sigh&amp;gt;, &amp;lt;giggle&amp;gt; and other prompt control tokens&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Cloud service dependency&lt;/td>
&lt;td>Can run locally on llama.cpp/LM Studio, no need for cloud inference&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Separation from LLM&lt;/td>
&lt;td>Compatible with LLM dialogue structure, can directly generate speech responses in multimodal dialogue&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="63-limitation-analysis">6.3 Limitation Analysis&lt;/h3>
&lt;p>Orpheus has innovations in end-to-end design but also faces some practical application challenges:&lt;/p>
&lt;h4 id="631-emotion-control-lacks-structural-modeling">6.3.1 Emotion Control Lacks Structural Modeling&lt;/h4>
&lt;ul>
&lt;li>Emotions are only controlled through &amp;ldquo;prompt token&amp;rdquo; insertion, lacking systematic emotion modeling modules&lt;/li>
&lt;li>May lead to the same &amp;lt;laugh&amp;gt; showing unstable, occasionally ineffective performance (prompt injection instability)&lt;/li>
&lt;/ul>
&lt;h4 id="632-strong-decoder-binding">6.3.2 Strong Decoder Binding&lt;/h4>
&lt;ul>
&lt;li>Using SNAC decoder means final sound quality is tightly bound to the audio codec, cannot be freely replaced with alternatives like HiFi-GAN&lt;/li>
&lt;li>If the codec produces artifacts, the entire model struggles to independently optimize the decoding module&lt;/li>
&lt;/ul>
&lt;h4 id="633-difficult-customization">6.3.3 Difficult Customization&lt;/h4>
&lt;ul>
&lt;li>Does not support zero-shot speaker cloning&lt;/li>
&lt;li>Generating user-customized voices still requires &amp;ldquo;fine-tuning,&amp;rdquo; creating a training threshold&lt;/li>
&lt;/ul>
&lt;h2 id="7-outetts-gguf-format-optimized-tts">7. OuteTTS: GGUF Format Optimized TTS&lt;/h2>
&lt;h3 id="71-architecture-design">7.1 Architecture Design&lt;/h3>
&lt;p>OuteTTS adopts an optimized design suitable for GGUF format deployment:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Prompt input - text and control information] --&amp;gt; B[Prompt Encoder - semantic modeling]
B --&amp;gt; C[Alignment module - automatic position alignment]
C --&amp;gt; D[Codebook Decoder - generates dual codebook tokens]
D --&amp;gt; E[HiFi-GAN Vocoder - restores to speech waveform]
E --&amp;gt; F[Output audio - wav or mp3]
subgraph Control Information
A1[Tone pause emotion tokens]
A2[Pitch duration speaker ID]
end
A1 --&amp;gt; A
A2 --&amp;gt; A
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Prompt Encoder&lt;/td>
&lt;td>Input is natural language prompt (with context, speaker, timbre information), similar to instruction-guided model generating speech content&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Alignment Module (internal modeling)&lt;/td>
&lt;td>Embedded alignment capability, no need for external alignment tool, builds position-to-token mapping based on transformer&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Codebook Decoder&lt;/td>
&lt;td>Maps text to dual codebook tokens under DAC encoder (e.g., codec-C1, codec-C2), as latent representation of audio content&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder (HiFi-GAN)&lt;/td>
&lt;td>Maps DAC codebook or speech features to final playable audio (supports .wav), deployed on CPU/GPU&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="dac-decoder-in-detail">DAC Decoder in Detail&lt;/h4>
&lt;p>DAC (Discrete Audio Codec) is a discrete audio codec used in TTS models primarily to convert audio codes generated by OuteTTS models into actual audio waveforms. DAC is an efficient neural network audio decoder specifically designed for high-quality speech synthesis.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[OuteTTS audio codes] --&amp;gt; B[DAC decoding]
B --&amp;gt; C[PCM audio waveform]
A --&amp;gt; |c1_codes| B
A --&amp;gt; |c2_codes| B
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Technical Architecture&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Encoding Structure&lt;/strong>: DAC uses a 2-layer encoding structure (dual codebook), with each codebook having a size of 1024, which differs from SNAC's 3-layer structure.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Code Format&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>DAC uses two sets of codes: c1_codes and c2_codes&lt;/li>
&lt;li>These two sets of codes have the same length and correspond one-to-one&lt;/li>
&lt;li>Each code has a value range of 0-1023&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Decoding Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code>OuteTTS audio codes(c1_codes, c2_codes) → DAC decoding → PCM audio waveform
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Sample Rate&lt;/strong>: DAC generates 24kHz sample rate audio, the same as SNAC&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Implementation Methods&lt;/strong>&lt;/p>
&lt;p>Similar to SNAC, DAC also has two implementation methods:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>PyTorch Implementation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Uses the original PyTorch model for decoding&lt;/li>
&lt;li>Suitable for environments without ONNX support&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>ONNX Optimized Implementation&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Uses pre-trained models in ONNX format&lt;/li>
&lt;li>Supports hardware acceleration (CUDA or CPU)&lt;/li>
&lt;li>Provides quantized versions, reducing model size and improving inference speed&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>DAC's Advanced Features&lt;/strong>&lt;/p>
&lt;p>The DAC decoder implements several advanced features that make it particularly suitable for streaming TTS applications:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Batch Processing Optimization&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Adaptive batch size (8-64 frames)&lt;/li>
&lt;li>Dynamically adjusts batch size based on performance history&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Streaming Processing&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Supports batch decoding and streaming output&lt;/li>
&lt;li>Adaptively adjusts parameters based on network quality&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Audio Effect Processing&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Supports fade in/out effects&lt;/li>
&lt;li>Supports audio gain adjustment&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h4 id="comparison-between-snac-and-dac">Comparison Between SNAC and DAC&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Feature&lt;/th>
&lt;th>DAC&lt;/th>
&lt;th>SNAC&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Encoding Layers&lt;/td>
&lt;td>2 layers&lt;/td>
&lt;td>3 layers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Code Organization&lt;/td>
&lt;td>Two parallel code sets&lt;/td>
&lt;td>Three hierarchical code layers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Codebook Size&lt;/td>
&lt;td>1024&lt;/td>
&lt;td>4096&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Input Format&lt;/td>
&lt;td>c1_codes, c2_codes&lt;/td>
&lt;td>7-layer Orpheus codes redistributed to 3 layers&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Applicable Models&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>DAC&lt;/strong>: Designed specifically for OuteTTS-type models, processes dual codebook format audio codes&lt;/li>
&lt;li>&lt;strong>SNAC&lt;/strong>: Designed specifically for Orpheus-type models, processes 7-layer encoded format audio codes&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Performance Characteristics&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>DAC&lt;/strong>: More focused on streaming processing and low latency, with more adaptive optimizations&lt;/li>
&lt;li>&lt;strong>SNAC&lt;/strong>: More focused on audio quality and accurate code redistribution&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Code Processing Methods&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>DAC&lt;/strong>: Directly processes two sets of codes, no complex redistribution needed&lt;/li>
&lt;li>&lt;strong>SNAC&lt;/strong>: Needs to remap 7-layer Orpheus codes to a 3-layer structure&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Why Different Models Use Different Decoders&lt;/strong>&lt;/p>
&lt;p>OuteTTS and Orpheus use different decoders primarily for the following reasons:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Model Design Differences&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>OuteTTS model was designed with DAC compatibility in mind, directly outputting DAC format dual codebook codes&lt;/li>
&lt;li>Orpheus model is based on a different architecture, outputting 7-layer encoding, requiring SNAC for decoding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Encoding Format Incompatibility&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>DAC expects to receive two parallel code sets (c1_codes, c2_codes)&lt;/li>
&lt;li>SNAC expects to receive redistributed 3-layer codes, which come from Orpheus's 7-layer output&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Different Optimization Directions&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>OuteTTS+DAC combination focuses more on streaming processing and low latency&lt;/li>
&lt;li>Orpheus+SNAC combination focuses more on audio quality and multi-level encoding&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;h3 id="72-technical-advantages">7.2 Technical Advantages&lt;/h3>
&lt;p>OuteTTS provides innovative solutions to multiple issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Target Issue&lt;/th>
&lt;th>Llama-OuteTTS's Strategy&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Multilingual TTS without preprocessing&lt;/td>
&lt;td>Directly supports Chinese, English, Japanese, Arabic and other languages, no need for pinyin conversion or forced spacing&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Difficult alignment, requires external CTC&lt;/td>
&lt;td>Model has built-in alignment mechanism, directly aligns text to generated tokens, no need for external alignment tools&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Audio quality vs. throughput conflict&lt;/td>
&lt;td>DAC + dual codebook improves audio quality; generates 150 tokens per second, speed significantly improved compared to similar diffusion models&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Complex model invocation&lt;/td>
&lt;td>GGUF format encapsulated structure + llama.cpp support, more streamlined local deployment&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="73-limitation-analysis">7.3 Limitation Analysis&lt;/h3>
&lt;p>OuteTTS has innovations in GGUF format optimization but also faces some practical application challenges:&lt;/p>
&lt;h4 id="731-audio-encoding-bottleneck">7.3.1 Audio Encoding Bottleneck&lt;/h4>
&lt;ul>
&lt;li>Currently mainly uses DAC-based dual codebook expression, which improves audio quality, but:
&lt;ul>
&lt;li>Decoder (HiFi-GAN) remains a bottleneck, especially with inference latency on edge devices&lt;/li>
&lt;li>If using more complex models (like VQ-VAE) in the future, their parallelism and efficient inference will become more problematic&lt;/li>
&lt;li>Current gguf-connector is C++-based, does not yet support native mobile deployment (like Android/iOS TensorDelegate)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="732-parallelism-and-context-dependency">7.3.2 Parallelism and Context Dependency&lt;/h4>
&lt;ul>
&lt;li>Model strongly depends on context memory (such as token temporal dependencies), during inference:
&lt;ul>
&lt;li>Cannot parallelize extensively like some autoregressive diffusion models, inference remains serially dominated&lt;/li>
&lt;li>Sampling stage requires setting repetition penalty window (default 64 tokens)&lt;/li>
&lt;li>High context length (e.g., 8192) is supported but significantly increases memory cost during deployment&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="733-insufficient-style-transfer-and-personality-control">7.3.3 Insufficient Style Transfer and Personality Control&lt;/h4>
&lt;ul>
&lt;li>Current version mainly optimized for &amp;ldquo;single person + tone control,&amp;rdquo; style transfer mechanism not sophisticated enough:
&lt;ul>
&lt;li>Lacks speaker embedding-based control mechanism&lt;/li>
&lt;li>Multi-emotion, multi-style still requires prompt fine-tuning rather than explicit token control&lt;/li>
&lt;li>Future needs to introduce speaker encoder or style/emotion vectors&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="8-f5tts-diffusion-model-optimized-tts">8. F5-TTS: Diffusion Model Optimized TTS&lt;/h2>
&lt;h3 id="81-architecture-design">8.1 Architecture Design&lt;/h3>
&lt;p>F5-TTS adopts an innovative design based on diffusion models:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text - character sequence] --&amp;gt; B[ConvNeXt text encoder]
B --&amp;gt; C[Flow Matching module]
C --&amp;gt; D[DiT diffusion Transformer - non-autoregressive generation]
D --&amp;gt; E[Speech Token]
E --&amp;gt; F[Vocoder - Vocos or BigVGAN]
F --&amp;gt; G[Waveform audio output]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>ConvNeXt text encoder&lt;/td>
&lt;td>Used to extract global features of text, with parallel convolution capability&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Flow Matching&lt;/td>
&lt;td>Used in training process to learn noise → speech token mapping path&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>DiT (Diffusion Transformer)&lt;/td>
&lt;td>Core synthesizer, parallel speech token generator based on diffusion modeling&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Sway Sampling&lt;/td>
&lt;td>Optimizes sampling path during inference, reducing ineffective diffusion steps, improving speed and quality&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder&lt;/td>
&lt;td>Uses BigVGAN or Vocos to restore speech tokens to waveform audio&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="82-technical-advantages">8.2 Technical Advantages&lt;/h3>
&lt;p>F5-TTS provides innovative solutions to multiple issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Issue&lt;/th>
&lt;th>F5-TTS's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Phoneme alignment, duration dependency&lt;/td>
&lt;td>Input characters directly fill alignment, not dependent on duration predictor or aligner&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Unnatural speech quality, weak cloning ability&lt;/td>
&lt;td>Uses diffusion-based speech token synthesis, with sway sampling technology to enhance naturalness&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="83-limitation-analysis">8.3 Limitation Analysis&lt;/h3>
&lt;p>F5-TTS has innovations in diffusion model optimization but also faces some practical application challenges:&lt;/p>
&lt;h4 id="831-inference-requires-multistep-sampling">8.3.1 Inference Requires Multi-Step Sampling&lt;/h4>
&lt;p>Although sway sampling is optimized, inference still needs to execute diffusion sampling process (about 20 steps)&lt;/p>
&lt;h4 id="832-dependency-on-vocoder">8.3.2 Dependency on Vocoder&lt;/h4>
&lt;p>Final speech quality highly depends on vocoder (like vocos, BigVGAN), requiring separate deployment&lt;/p>
&lt;h4 id="833-weak-audio-length-control">8.3.3 Weak Audio Length Control&lt;/h4>
&lt;p>No explicit duration predictor, speed control requires additional prompts or sampling techniques&lt;/p>
&lt;h4 id="834-license-restrictions">8.3.4 License Restrictions&lt;/h4>
&lt;p>Uses CC-BY-NC-4.0 open source license, cannot be used commercially directly, must follow authorization terms&lt;/p>
&lt;h2 id="9-indextts-multimodal-conditional-tts">9. Index-TTS: Multimodal Conditional TTS&lt;/h2>
&lt;h3 id="91-architecture-design">9.1 Architecture Design&lt;/h3>
&lt;p>Index-TTS adopts an innovative design with multimodal conditional control:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text input] --&amp;gt; B[Pinyin-enhanced text encoder]
B --&amp;gt; C[GPT-style language model - Decoder-only]
C --&amp;gt; D[Predict speech token sequence]
D --&amp;gt; E[BigVGAN2 - decode to waveform]
F[Reference speech] --&amp;gt; G[Conformer conditional encoder]
G --&amp;gt; C
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module Name&lt;/th>
&lt;th>Function Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Text encoder (character + pinyin)&lt;/td>
&lt;td>Chinese supports pinyin input, English directly models characters - Can accurately capture pronunciation features, solving complex reading problems like polyphonic characters and neutral tones&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Neural audio tokenizer&lt;/td>
&lt;td>Uses FSQ encoder to convert audio to discrete tokens - Each frame (25Hz) expressed with multiple codebooks, token utilization rate reaches 98%, far higher than VQ&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>LLM-style Decoder (GPT structure)&lt;/td>
&lt;td>Decoder-only Transformer architecture - Conditional inputs include text tokens and reference audio - Supports multi-speaker migration and zero-shot speech generation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Conditional Conformer encoder&lt;/td>
&lt;td>Encodes implicit features like timbre, rhythm, prosody in reference audio - Provides stable control vector input to GPT, enhancing stability and timbre restoration&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>BigVGAN2&lt;/td>
&lt;td>Decodes final audio waveform - Balances high fidelity and real-time synthesis performance&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="92-technical-advantages">9.2 Technical Advantages&lt;/h3>
&lt;p>Index-TTS provides innovative solutions to multiple issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Issue&lt;/th>
&lt;th>IndexTTS's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Polyphonic character control&lt;/td>
&lt;td>Character+pinyin joint modeling, can explicitly specify pronunciation&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor speaker consistency&lt;/td>
&lt;td>Introduces Conformer conditional module, uses reference audio to enhance control capability&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Low audio token utilization&lt;/td>
&lt;td>Uses FSQ instead of VQ-VAE, effectively utilizes codebook, enhances expressiveness&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor model stability&lt;/td>
&lt;td>Phased training + conditional control, reduces divergence, ensures synthesis quality&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Poor English compatibility&lt;/td>
&lt;td>IndexTTS 1.5 strengthens English token learning, enhances cross-language adaptability&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Slow inference&lt;/td>
&lt;td>GPT decoder + BigVGAN2, balances naturalness and speed, can deploy industrial systems&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="93-limitation-analysis">9.3 Limitation Analysis&lt;/h3>
&lt;p>Index-TTS has innovations in multimodal conditional control but also faces some practical application challenges:&lt;/p>
&lt;h4 id="931-prosody-control-depends-on-reference-audio">9.3.1 Prosody Control Depends on Reference Audio&lt;/h4>
&lt;ul>
&lt;li>Current model's prosody generation mainly relies on implicit guidance from input reference audio
&lt;ul>
&lt;li>Lacks explicit prosody annotation or token control mechanism, cannot manually control pauses, stress, intonation, and other information&lt;/li>
&lt;li>When reference audio is not ideal or style differences are large, prosody transfer effects can easily become unnatural or inconsistent&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Not conducive to template-based large-scale application scenarios (such as customer service, reading) where controllability and stability are needed&lt;/li>
&lt;/ul>
&lt;h4 id="932-generation-uncertainty">9.3.2 Generation Uncertainty&lt;/h4>
&lt;ul>
&lt;li>Uses GPT-style autoregressive generation structure, although speech naturalness is high, there is some uncertainty:
&lt;ul>
&lt;li>The same input in different inference rounds may fluctuate in speech rate, prosody, and slight timbre&lt;/li>
&lt;li>Difficult to completely reproduce generation results, not conducive to audio caching and version management&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>In high-consistency requirement scenarios (such as film post-production, legal synthesis), may affect delivery stability&lt;/li>
&lt;/ul>
&lt;h4 id="933-speaker-migration-not-completely-endtoend">9.3.3 Speaker Migration Not Completely End-to-End&lt;/h4>
&lt;ul>
&lt;li>Current speaker control module still relies on explicit reference audio embedding (such as speaker encoder) as conditional vector input
&lt;ul>
&lt;li>Speaker vectors need external module extraction, not end-to-end integration&lt;/li>
&lt;li>When reference audio quality is low or speaking style varies greatly, cloning effect is unstable&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Does not support completely text-driven speaker specification (such as specifying speaker ID generation), limiting automated deployment flexibility&lt;/li>
&lt;/ul>
&lt;h2 id="10-megatts3-unified-modeling-tts">10. Mega-TTS3: Unified Modeling TTS&lt;/h2>
&lt;h3 id="101-architecture-design">10.1 Architecture Design&lt;/h3>
&lt;p>Mega-TTS3 adopts an innovative design with unified modeling:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text Token - BPE] --&amp;gt; B[Text Encoder - Transformer]
B --&amp;gt; C[Unified Acoustic Model - UAM]
C --&amp;gt; D[Latent Acoustic Token]
subgraph Control Branches
E1[Prosody embedding] --&amp;gt; C
E2[Speaker representation] --&amp;gt; C
E3[Language label] --&amp;gt; C
end
D --&amp;gt; F[Vocoder - HiFi-GAN or FreGAN]
F --&amp;gt; G[Audio output]
&lt;/code>&lt;/pre>
&lt;p>Main modules and their functions:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Module&lt;/th>
&lt;th>Description&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Text Encoder&lt;/td>
&lt;td>Encodes input text tokens into semantic vectors, supports multilingual tokens&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>UAM (Unified Acoustic Model)&lt;/td>
&lt;td>Core module, fuses Text, Prosody, Speaker, Language information, predicts acoustic latent&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Continuous Speaker Modeling&lt;/td>
&lt;td>Models speaker information across time sequence, reducing style drift issues&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Prosody Control Module&lt;/td>
&lt;td>Provides independent prosody controller, can precisely control pauses, rhythm, pitch, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Vocoder&lt;/td>
&lt;td>Finally decodes latent tokens into audio waveforms, using HiFi-GAN / FreGAN&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="102-technical-advantages">10.2 Technical Advantages&lt;/h3>
&lt;p>Mega-TTS3 provides innovative solutions to multiple issues in TTS models:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Issue&lt;/th>
&lt;th>Description&lt;/th>
&lt;th>Mega-TTS3's Solution&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Inconsistent modeling granularity&lt;/td>
&lt;td>Different modules (text, prosody, speech) have inconsistent modeling granularity, causing information fragmentation and style transfer distortion&lt;/td>
&lt;td>Introduces Unified Acoustic Model (UAM), fusing text encoding, prosody information, language labels and audio latent in unified modeling, avoiding staged information loss&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Difficult multi-speaker modeling&lt;/td>
&lt;td>Traditional embedding methods struggle to stably model large numbers of speakers, with insufficient generalization and synthesis consistency&lt;/td>
&lt;td>Proposes Continuous Speaker Embedding, embedding speaker representation as temporal vector into unified modeling process, improving style consistency and transfer stability&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Weak control granularity&lt;/td>
&lt;td>Lacks pluggable independent control mechanisms when controlling emotion, speed, prosody, and other styles&lt;/td>
&lt;td>Designs pluggable control branches (Prosody / Emotion / Language / Speaker Embedding), each control signal independently modeled, can be combined and flexibly plugged in, enhancing control precision&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Cross-language interference&lt;/td>
&lt;td>Sparse language label modeling, multi-language models often interfere with each other, affecting speech quality&lt;/td>
&lt;td>Introduces explicit language label embedding + multilingual shared Transformer parameter mechanism, enhancing language sharing while ensuring language identifiability, alleviating inter-language interference&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="103-limitation-analysis">10.3 Limitation Analysis&lt;/h3>
&lt;p>Mega-TTS3 has innovations in unified modeling but also faces some practical application challenges:&lt;/p>
&lt;h4 id="1031-limited-control-granularity--weak-interpretability">10.3.1 Limited Control Granularity &amp;amp; Weak Interpretability&lt;/h4>
&lt;ul>
&lt;li>Although control dimensions are many (emotion, speed, prosody, etc.), they still rely on end-to-end model implicit modeling:
&lt;ul>
&lt;li>Lacks pluggable independent control modules&lt;/li>
&lt;li>Strong coupling between control variables, difficult to precisely control single dimensions&lt;/li>
&lt;li>Not suitable for &amp;ldquo;controllable interpretable synthesis&amp;rdquo; scenarios oriented toward industrial deployment&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="1032-uneven-multilingual-speech-quality">10.3.2 Uneven Multilingual Speech Quality&lt;/h4>
&lt;ul>
&lt;li>Despite supporting multilingual modeling, actual generation still shows:
&lt;ul>
&lt;li>Heavy dependence on language labels, label errors directly lead to pronunciation disorder&lt;/li>
&lt;li>Inter-language interference issues (such as accent drift in Chinese-English mixed reading)&lt;/li>
&lt;li>Low-resource language generation effects significantly lower than high-resource languages&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="11-summary-and-outlook">11. Summary and Outlook&lt;/h2>
&lt;h3 id="111-modern-tts-model-architecture-trends">11.1 Modern TTS Model Architecture Trends&lt;/h3>
&lt;p>Through in-depth analysis of ten mainstream TTS models, we can observe the following clear technical trends:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Unified Architecture&lt;/strong>: From early multi-module cascades to today's end-to-end unified architectures, TTS models are developing toward more integrated directions&lt;/li>
&lt;li>&lt;strong>Discrete Token Representation&lt;/strong>: Using discrete tokens to represent audio has become mainstream, more suitable for fusion with models like LLMs&lt;/li>
&lt;li>&lt;strong>Coexistence of Diffusion and Autoregression&lt;/strong>: Diffusion models provide high-quality generation capabilities, while autoregressive models have advantages in context modeling&lt;/li>
&lt;li>&lt;strong>Multimodal Conditional Control&lt;/strong>: Controlling speech generation through multimodal inputs such as reference audio and emotion labels, enhancing personalization capabilities&lt;/li>
&lt;li>&lt;strong>Deployment Format Standardization&lt;/strong>: Popularization of formats like GGUF makes TTS models easier to deploy on different platforms&lt;/li>
&lt;/ol>
&lt;h3 id="112-technical-challenges-and-future-directions">11.2 Technical Challenges and Future Directions&lt;/h3>
&lt;p>Despite significant progress in modern TTS models, they still face some key challenges:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Inference Efficiency vs. Audio Quality Balance&lt;/strong>: How to improve inference speed while ensuring high audio quality, especially on edge devices&lt;/li>
&lt;li>&lt;strong>Controllability vs. Naturalness Trade-off&lt;/strong>: Enhancing control capabilities often sacrifices speech naturalness; balancing the two is an ongoing challenge&lt;/li>
&lt;li>&lt;strong>Multilingual Consistency&lt;/strong>: Building truly high-quality multilingual TTS models, ensuring consistency and quality across languages&lt;/li>
&lt;li>&lt;strong>Emotional Expression Depth&lt;/strong>: Current models still have limitations in nuanced emotional expression, requiring deeper emotion modeling in the future&lt;/li>
&lt;li>&lt;strong>Long Text Coherence&lt;/strong>: Improving coherence and consistency in long text generation, especially at paragraph and chapter levels of speech synthesis&lt;/li>
&lt;/ol>
&lt;h3 id="113-application-scenario-matching-recommendations">11.3 Application Scenario Matching Recommendations&lt;/h3>
&lt;p>Different TTS models are suitable for different application scenarios. Here are some matching recommendations:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Application Scenario&lt;/th>
&lt;th>Recommended Models&lt;/th>
&lt;th>Rationale&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Edge devices/Low-resource environments&lt;/td>
&lt;td>Kokoro, Dia&lt;/td>
&lt;td>Lightweight design, supports ONNX/GGUF format, low latency&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>High-quality audio content creation&lt;/td>
&lt;td>Index-TTS, F5-TTS&lt;/td>
&lt;td>High-quality output, supports reference audio cloning, suitable for professional content production&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Multilingual customer service systems&lt;/td>
&lt;td>Mega-TTS3&lt;/td>
&lt;td>Excellent multilingual support, unified modeling architecture, good stability&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Conversational voice assistants&lt;/td>
&lt;td>CosyVoice, Orpheus&lt;/td>
&lt;td>Good compatibility with LLMs, supports dialogue context, natural emotional expression&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Local deployment voice applications&lt;/td>
&lt;td>OuteTTS&lt;/td>
&lt;td>GGUF format optimization, supports CPU inference, no need for cloud services&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>With continued technological advancement, we can expect future TTS models to further break modal boundaries, achieving more natural, personalized, and emotionally rich voice interaction experiences.&lt;/p></description></item><item><title>Speech Synthesis Evolution: From Traditional TTS to Multimodal Voice Models</title><link>https://ziyanglin.netlify.app/en/post/tts-fundamentals/</link><pubDate>Fri, 27 Jun 2025 07:01:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/tts-fundamentals/</guid><description>&lt;h2 id="1-background">1. Background&lt;/h2>
&lt;h3 id="11-pain-points-of-traditional-tts-models">1.1 Pain Points of Traditional TTS Models&lt;/h3>
&lt;p>Traditional Text-to-Speech (TTS) models have excelled in voice cloning and speech synthesis, typically employing a two-stage process:&lt;/p>
&lt;ol>
&lt;li>Acoustic Model (e.g., Tacotron): Converts text into intermediate acoustic representations (such as spectrograms).&lt;/li>
&lt;li>Vocoder (e.g., WaveGlow, HiFi-GAN): Transforms acoustic representations into waveform audio.&lt;/li>
&lt;/ol>
&lt;p>Despite these models&amp;rsquo; ability to produce realistic sounds, their primary focus remains on replicating a speaker's voice, lacking the flexibility to adapt in dynamic, context-sensitive conversations.&lt;/p>
&lt;h3 id="12-initial-integration-of-llms-contextaware-conversational-voice-models">1.2 Initial Integration of LLMs: Context-Aware Conversational Voice Models&lt;/h3>
&lt;p>The emergence of Large Language Models (LLMs) has provided rich reasoning capabilities and contextual understanding. Integrating LLMs into the TTS workflow enables synthesis that goes beyond mere sound production to intelligent conversational responses within context.&lt;/p>
&lt;p>Typical cascade workflow (speech-to-speech model):&lt;/p>
&lt;ul>
&lt;li>STT (Speech-to-Text): e.g., Whisper&lt;/li>
&lt;li>LLM (Contextual Understanding and Generation): e.g., fine-tuned Llama&lt;/li>
&lt;li>TTS (Text-to-Speech): e.g., ElevenLabs&lt;/li>
&lt;/ul>
&lt;p>Example workflow:&lt;/p>
&lt;pre>&lt;code>Speech-to-Text (e.g., Whisper) : &amp;quot;Hello friend, how are you?&amp;quot;
Conversational LLM (e.g., Llama) : &amp;quot;Hi there! I am fine and you?&amp;quot;
Text-to-Speech (e.g., ElevenLabs) : Generates natural speech response
&lt;/code>&lt;/pre>
&lt;p>This pipeline approach integrates the strengths of specialized modules but has limitations:
The transcribed text received by the LLM loses rich prosodic and emotional cues from the original speech, resulting in responses that lack the nuanced expression of the original voice.&lt;/p>
&lt;h3 id="13-direct-speech-input-to-llms-audio-encoders-and-neural-codecs">1.3 Direct Speech Input to LLMs: Audio Encoders and Neural Codecs&lt;/h3>
&lt;p>To address the above bottlenecks, researchers have attempted to directly input speech representations into LLMs. Currently, there are two main approaches to converting continuous high-dimensional speech signals into formats that LLMs can process:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Audio Encoders&lt;/strong>: Convert continuous speech into discrete tokens, preserving key information such as rhythm and emotion.&lt;/p>
&lt;blockquote>
&lt;p>New Challenge: Audio encoders must balance between preserving critical information and the need for compact, discrete representations.&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Neural Codecs&lt;/strong>: Such as DAC, Encodec, XCodec, which convert audio waveforms into discrete token sequences, bridging the gap between continuous audio and discrete token requirements.&lt;/p>
&lt;blockquote>
&lt;p>New Challenge: Audio tokens are far more numerous than text, and the quantization process may lead to loss of details.&lt;/p>
&lt;/blockquote>
&lt;/li>
&lt;/ul>
&lt;h2 id="2-tts-model-structure">2. TTS Model Structure&lt;/h2>
&lt;p>The basic structural flow of traditional TTS models is typically as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Text] --&amp;gt; B[Encoder]
B --&amp;gt; C[Intermediate Representation]
C --&amp;gt; D[Decoder]
D --&amp;gt; E[Mel Spectrogram]
E --&amp;gt; F[Vocoder]
F --&amp;gt; G[Waveform]
&lt;/code>&lt;/pre>
&lt;p>This workflow includes several key components:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Text Encoder&lt;/strong>: Responsible for converting input text into an intermediate representation, usually a deep learning model such as a Transformer or CNN. The encoder needs to understand the semantics, syntactic structure of the text, and extract pronunciation-related features.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Intermediate Representation&lt;/strong>: The bridge connecting the encoder and decoder, typically a set of vectors or feature maps containing the semantic information of the text and some preliminary acoustic features.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Decoder&lt;/strong>: Converts the intermediate representation into acoustic features, such as Mel spectrograms. The decoder needs to consider factors like prosody, rhythm, and pauses in speech.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Vocoder&lt;/strong>: Transforms acoustic features (such as Mel spectrograms) into final waveform audio. Modern vocoders like HiFi-GAN and WaveGlow can generate high-quality speech waveforms.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="3-indepth-analysis-of-audio-encoder-technology">3. In-Depth Analysis of Audio Encoder Technology&lt;/h2>
&lt;p>Audio encoders are crucial bridges connecting continuous speech signals with discrete token representations. Below, we delve into several mainstream audio encoding technologies and their working principles.&lt;/p>
&lt;h3 id="31-vqvae-vector-quantized-variational-autoencoder">3.1 VQ-VAE (Vector Quantized Variational Autoencoder)&lt;/h3>
&lt;p>VQ-VAE is an effective method for converting continuous audio signals into discrete codes. Its working principle is as follows:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Encoding Stage&lt;/strong>: Uses an encoder network to convert input audio into continuous latent representations.&lt;/li>
&lt;li>&lt;strong>Quantization Stage&lt;/strong>: Maps continuous latent representations to the nearest discrete codebook vectors.&lt;/li>
&lt;li>&lt;strong>Decoding Stage&lt;/strong>: Uses a decoder network to reconstruct audio signals from quantized latent representations.&lt;/li>
&lt;/ol>
&lt;p>The advantage of VQ-VAE lies in its ability to learn compact discrete representations while preserving key information needed for audio reconstruction. However, it also faces challenges such as low codebook utilization (codebook collapse) and trade-offs between reconstruction quality and compression rate.&lt;/p>
&lt;h3 id="32-encodec">3.2 Encodec&lt;/h3>
&lt;p>Encodec is an efficient neural audio codec proposed by Meta AI, combining the ideas of VQ-VAE with multi-level quantization techniques:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Multi-Resolution Encoding&lt;/strong>: Uses encoders with different time resolutions to capture audio features at different time scales.&lt;/li>
&lt;li>&lt;strong>Residual Quantization&lt;/strong>: Adopts a multi-level quantization strategy, with each level of quantizer processing the residual error from the previous level.&lt;/li>
&lt;li>&lt;strong>Variable Bit Rate&lt;/strong>: Supports different compression levels, allowing for adjustment of the balance between bit rate and audio quality according to needs.&lt;/li>
&lt;/ol>
&lt;p>A significant advantage of Encodec is its ability to maintain good audio quality at extremely low bit rates, making it particularly suitable for speech synthesis and audio transmission applications.&lt;/p>
&lt;h3 id="33-dac-discrete-autoencoder-for-audio-compression">3.3 DAC (Discrete Autoencoder for Audio Compression)&lt;/h3>
&lt;p>DAC is a discrete autoencoder designed specifically for audio compression, with features including:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Hierarchical Quantization&lt;/strong>: Uses a multi-level quantization structure, with different levels capturing different levels of audio detail.&lt;/li>
&lt;li>&lt;strong>Context Modeling&lt;/strong>: Utilizes autoregressive models to model quantized token sequences, capturing temporal dependencies.&lt;/li>
&lt;li>&lt;strong>Perceptual Loss Function&lt;/strong>: Combines spectral loss and adversarial loss to optimize audio quality as perceived by the human ear.&lt;/li>
&lt;/ol>
&lt;p>DAC maintains excellent audio quality even at high compression rates, making it particularly suitable for speech synthesis applications requiring efficient storage and transmission.&lt;/p>
&lt;h2 id="4-audio-data-formats-and-transmission-in-tts-systems">4. Audio Data Formats and Transmission in TTS Systems&lt;/h2>
&lt;p>In TTS systems, the choice of audio formats and transmission methods is crucial for practical applications. This chapter details the various audio formats, transmission protocols, and frontend processing techniques used in TTS systems.&lt;/p>
&lt;h3 id="41-common-audio-formats-and-their-characteristics">4.1 Common Audio Formats and Their Characteristics&lt;/h3>
&lt;p>TTS systems support multiple audio formats, each with specific use cases and trade-offs. Here are the most commonly used formats:&lt;/p>
&lt;h4 id="411-pcm-pulse-code-modulation">4.1.1 PCM (Pulse Code Modulation)&lt;/h4>
&lt;p>&lt;strong>Characteristics:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>No Compression&lt;/strong>: Raw audio data without any compression&lt;/li>
&lt;li>&lt;strong>Bit Depth&lt;/strong>: Typically 16-bit (also 8-bit, 24-bit, 32-bit, etc.)&lt;/li>
&lt;li>&lt;strong>Simple Format&lt;/strong>: Directly represents audio waveform as digital samples&lt;/li>
&lt;li>&lt;strong>File Size&lt;/strong>: Large, about 2.8MB for one minute of 24kHz/16-bit mono audio&lt;/li>
&lt;li>&lt;strong>Processing Overhead&lt;/strong>: Low, no decoding required&lt;/li>
&lt;li>&lt;strong>Quality&lt;/strong>: Lossless, preserves all original audio information&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Use Cases:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Internal audio processing pipelines&lt;/li>
&lt;li>Real-time applications requiring low latency&lt;/li>
&lt;li>Intermediate format for further processing&lt;/li>
&lt;/ul>
&lt;h4 id="412-opus">4.1.2 Opus&lt;/h4>
&lt;p>&lt;strong>Characteristics:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Compression Ratio&lt;/strong>: Much smaller than PCM while maintaining high quality&lt;/li>
&lt;li>&lt;strong>Low Latency&lt;/strong>: Encoding/decoding delay as low as 20ms&lt;/li>
&lt;li>&lt;strong>Variable Bitrate&lt;/strong>: 6kbps to 510kbps&lt;/li>
&lt;li>&lt;strong>Adaptive&lt;/strong>: Can adjust based on network conditions&lt;/li>
&lt;li>&lt;strong>Designed for Network Transmission&lt;/strong>: Strong packet loss resistance&lt;/li>
&lt;li>&lt;strong>Open Standard&lt;/strong>: Royalty-free, widely supported&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Use Cases:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Network streaming&lt;/li>
&lt;li>WebRTC applications&lt;/li>
&lt;li>Real-time communication systems&lt;/li>
&lt;li>WebSocket audio transmission&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Opus Encoding Configuration:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Sample Rate&lt;/strong>: 24000 Hz&lt;/li>
&lt;li>&lt;strong>Channels&lt;/strong>: 1 (Mono)&lt;/li>
&lt;li>&lt;strong>Bitrate&lt;/strong>: 32000 bps (32 kbps)&lt;/li>
&lt;li>&lt;strong>Frame Size&lt;/strong>: 480 samples (corresponding to 20ms@24kHz)&lt;/li>
&lt;li>&lt;strong>Complexity&lt;/strong>: 5 (balanced setting)&lt;/li>
&lt;/ul>
&lt;h4 id="413-mp3">4.1.3 MP3&lt;/h4>
&lt;p>&lt;strong>Characteristics:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Compression Ratio&lt;/strong>: Much smaller than PCM&lt;/li>
&lt;li>&lt;strong>Wide Compatibility&lt;/strong>: Supported by almost all devices and platforms&lt;/li>
&lt;li>&lt;strong>Variable Bitrate&lt;/strong>: Typically 32kbps to 320kbps&lt;/li>
&lt;li>&lt;strong>Lossy Compression&lt;/strong>: Loses some audio information&lt;/li>
&lt;li>&lt;strong>Encoding/Decoding Delay&lt;/strong>: Higher, not suitable for real-time applications&lt;/li>
&lt;li>&lt;strong>File Size&lt;/strong>: Medium, about 1MB for one minute of audio (128kbps)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Use Cases:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Non-real-time applications&lt;/li>
&lt;li>Scenarios requiring wide compatibility&lt;/li>
&lt;li>Audio storage and distribution&lt;/li>
&lt;/ul>
&lt;h4 id="414-wav">4.1.4 WAV&lt;/h4>
&lt;p>&lt;strong>Characteristics:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Container Format&lt;/strong>: Typically contains PCM data&lt;/li>
&lt;li>&lt;strong>No Compression&lt;/strong>: Large files&lt;/li>
&lt;li>&lt;strong>Metadata Support&lt;/strong>: Contains information about sample rate, channels, etc.&lt;/li>
&lt;li>&lt;strong>Wide Compatibility&lt;/strong>: Supported by almost all audio software&lt;/li>
&lt;li>&lt;strong>Simple Structure&lt;/strong>: Easy to process&lt;/li>
&lt;li>&lt;strong>Quality&lt;/strong>: Typically lossless&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Use Cases:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Audio archiving&lt;/li>
&lt;li>Professional audio processing&lt;/li>
&lt;li>Testing and development environments&lt;/li>
&lt;/ul>
&lt;h3 id="42-tts-audio-transmission-and-processing">4.2 TTS Audio Transmission and Processing&lt;/h3>
&lt;h4 id="421-basic-audio-parameters">4.2.1 Basic Audio Parameters&lt;/h4>
&lt;p>In TTS systems, audio data typically has the following basic parameters:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Sample Rate&lt;/strong>: Typically 24000 Hz (24 kHz)&lt;/li>
&lt;li>&lt;strong>Channels&lt;/strong>: 1 (Mono)&lt;/li>
&lt;li>&lt;strong>Bit Depth&lt;/strong>: 16-bit (Int16)&lt;/li>
&lt;/ul>
&lt;h4 id="422-transmission-protocols">4.2.2 Transmission Protocols&lt;/h4>
&lt;p>&lt;strong>HTTP REST API&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Content-Type&lt;/strong>: &lt;code>audio/opus&lt;/code>&lt;/li>
&lt;li>&lt;strong>Custom Header&lt;/strong>: &lt;code>X-Sample-Rate: 24000&lt;/code>&lt;/li>
&lt;li>&lt;strong>Data Format&lt;/strong>: Raw Opus encoded data (non-OggS container)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>WebSocket Protocol&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Subprotocol&lt;/strong>: &lt;code>tts-1.0&lt;/code>&lt;/li>
&lt;li>&lt;strong>Message Structure&lt;/strong>: 1 byte type + 4 bytes length (little-endian) + payload&lt;/li>
&lt;li>&lt;strong>Audio Message Type&lt;/strong>: &lt;code>AUDIO = 0x12&lt;/code>&lt;/li>
&lt;li>&lt;strong>Audio Data&lt;/strong>: Raw Opus encoded data&lt;/li>
&lt;/ul>
&lt;h4 id="423-frontend-processing-techniques">4.2.3 Frontend Processing Techniques&lt;/h4>
&lt;p>The frontend of TTS systems needs to process received audio data, primarily in two ways:&lt;/p>
&lt;p>&lt;strong>WebCodecs API Decoding&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Uses browser hardware acceleration to decode Opus data&lt;/li>
&lt;li>Converts decoded data to Float32Array for Web Audio API&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>PCM Direct Processing&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Converts Int16 PCM data to Float32 audio data (range from -32768~32767 to -1.0~1.0)&lt;/li>
&lt;li>Creates AudioBuffer and plays through Web Audio API&lt;/li>
&lt;/ul>
&lt;h4 id="424-audio-processing-enhancements">4.2.4 Audio Processing Enhancements&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Fade In/Out Effects&lt;/strong>: Configurable audio fade in/out processing, default 10ms&lt;/li>
&lt;li>&lt;strong>Audio Gain Adjustment&lt;/strong>: Adjustable volume&lt;/li>
&lt;li>&lt;strong>Watermarking&lt;/strong>: Optional audio watermarking functionality&lt;/li>
&lt;li>&lt;strong>Adaptive Batch Processing&lt;/strong>: Dynamically adjusts audio processing batch size based on performance&lt;/li>
&lt;/ul>
&lt;h3 id="43-audio-data-flow-in-tts-systems">4.3 Audio Data Flow in TTS Systems&lt;/h3>
&lt;p>In TTS models, audio data follows this flow from generation to playback:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[Text Input] --&amp;gt; B[TTS Engine]
B --&amp;gt; C[PCM Audio Data]
C --&amp;gt; D[Audio Encoding Opus or MP3]
D --&amp;gt; E[HTTP or WebSocket Transmission]
E --&amp;gt; F[Frontend Reception]
F --&amp;gt; G[Decoding]
G --&amp;gt; H[Web Audio API Playback]
&lt;/code>&lt;/pre>
&lt;h3 id="44-format-selection-in-practical-applications">4.4 Format Selection in Practical Applications&lt;/h3>
&lt;p>In practical TTS applications, format selection is primarily based on the use case:&lt;/p>
&lt;p>&lt;strong>Real-time Streaming TTS Applications&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Opus&lt;/strong> is preferred due to its low latency characteristics and high compression ratio&lt;/li>
&lt;li>Suitable for voice assistants, real-time dialogue systems, online customer service, etc.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Non-real-time TTS Applications&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>MP3&lt;/strong> is more commonly used because it's supported by almost all devices and platforms&lt;/li>
&lt;li>Suitable for audiobooks, pre-recorded announcements, content distribution, etc.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Internal System Processing&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>PCM&lt;/strong> format is commonly used for internal processing, providing highest quality and lowest processing delay&lt;/li>
&lt;li>Suitable for intermediate stages in audio processing pipelines&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Archiving and Professional Applications&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>WAV&lt;/strong> format is suitable for scenarios requiring metadata preservation and highest quality&lt;/li>
&lt;li>Suitable for professional audio editing, archiving, and quality assessment&lt;/li>
&lt;/ul>
&lt;h2 id="5-integration-of-neural-codecs-with-llms">5. Integration of Neural Codecs with LLMs&lt;/h2>
&lt;p>The fusion of neural codecs with LLMs is a key step in achieving end-to-end speech understanding and generation. This fusion faces several technical challenges:&lt;/p>
&lt;h3 id="51-token-rate-mismatch-problem">5.1 Token Rate Mismatch Problem&lt;/h3>
&lt;p>Speech signals have a much higher information density than text, resulting in far more audio tokens than text tokens. For example, one second of speech might require hundreds of tokens to represent, while the corresponding text might only need a few tokens. This mismatch poses challenges for LLM processing.&lt;/p>
&lt;p>Solutions include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Hierarchical Encoding&lt;/strong>: Using multi-level encoding structures to capture information at different time scales&lt;/li>
&lt;li>&lt;strong>Downsampling Strategies&lt;/strong>: Downsampling in the time dimension to reduce the number of tokens&lt;/li>
&lt;li>&lt;strong>Attention Mechanism Optimization&lt;/strong>: Designing special attention mechanisms to effectively handle long token sequences&lt;/li>
&lt;/ul>
&lt;h3 id="52-crossmodal-representation-alignment">5.2 Cross-Modal Representation Alignment&lt;/h3>
&lt;p>Text and speech are information from two different modalities, with natural differences in their representation spaces. To achieve effective fusion, the representation alignment problem needs to be solved.&lt;/p>
&lt;p>Main methods include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Joint Training&lt;/strong>: Simultaneously training text encoders and audio encoders to align their representation spaces&lt;/li>
&lt;li>&lt;strong>Contrastive Learning&lt;/strong>: Using contrastive loss functions to bring related text and speech representations closer while pushing unrelated representations apart&lt;/li>
&lt;li>&lt;strong>Cross-Modal Transformers&lt;/strong>: Designing specialized Transformer architectures to handle multi-modal inputs and learn relationships between them&lt;/li>
&lt;/ul>
&lt;h3 id="53-contextaware-speech-synthesis">5.3 Context-Aware Speech Synthesis&lt;/h3>
&lt;p>Traditional TTS models often lack understanding of context, resulting in generated speech lacking appropriate emotional and prosodic variations. After fusion with LLMs, models can generate more natural speech based on conversation context.&lt;/p>
&lt;p>Key technologies include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Context Encoding&lt;/strong>: Encoding conversation history into context vectors that influence speech generation&lt;/li>
&lt;li>&lt;strong>Emotion Control&lt;/strong>: Automatically adjusting the emotional color of speech based on context understanding&lt;/li>
&lt;li>&lt;strong>Prosody Modeling&lt;/strong>: Adjusting speech rhythm, pauses, and stress according to semantic importance and conversation state&lt;/li>
&lt;/ul>
&lt;h2 id="6-future-development-directions">6. Future Development Directions&lt;/h2>
&lt;p>As technology continues to advance, TTS models are developing in the following directions:&lt;/p>
&lt;h3 id="61-endtoend-multimodal-models">6.1 End-to-End Multimodal Models&lt;/h3>
&lt;p>Future voice models will break down barriers between modules, achieving true end-to-end training and inference. Such models will be able to generate natural speech outputs directly from raw inputs (text, speech, images, etc.) without explicit conversion of intermediate representations.&lt;/p>
&lt;h3 id="62-personalization-and-adaptability">6.2 Personalization and Adaptability&lt;/h3>
&lt;p>Next-generation TTS models will place greater emphasis on personalization and adaptability, automatically adjusting speech characteristics based on user preferences, conversation history, and environmental factors, providing a more natural and humanized interaction experience.&lt;/p>
&lt;h3 id="63-lowresource-scenario-optimization">6.3 Low-Resource Scenario Optimization&lt;/h3>
&lt;p>For low-resource languages and special application scenarios, researchers are exploring how to leverage transfer learning, meta-learning, and data augmentation techniques to build high-quality TTS models under limited data conditions.&lt;/p>
&lt;h3 id="64-realtime-interactive-speech-synthesis">6.4 Real-Time Interactive Speech Synthesis&lt;/h3>
&lt;p>With the advancement of algorithms and hardware, real-time interactive speech synthesis will become possible, supporting more natural and fluid human-machine dialogue, providing better user experiences for virtual assistants, customer service robots, and metaverse applications.&lt;/p>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Speech synthesis technology is undergoing a significant transformation from traditional TTS to multimodal voice models. Through the integration of large language models, neural codecs, and advanced audio processing technologies, modern TTS models can not only generate high-quality speech but also understand context, express emotions, and naturally adapt in dynamic conversations. Despite facing many challenges, with continuous technological advancement, we can expect more intelligent, natural, and personalized voice interaction experiences.&lt;/p></description></item><item><title>CLIP Technology Analysis: Unified Representation Through Image-Text Contrastive Learning</title><link>https://ziyanglin.netlify.app/en/post/clip-documentation/</link><pubDate>Fri, 27 Jun 2025 05:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/clip-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>CLIP (Contrastive Language-Image Pre-training) is an advanced deep learning model developed by OpenAI, designed to understand the relationship between images and the text that describes them. Through pre-training on millions of (image, text) pairs, CLIP learns a shared multimodal embedding space that maps both images and text to vectors within this space.&lt;/p>
&lt;p>The revolutionary aspect of CLIP lies in its powerful &lt;strong>Zero-Shot Learning&lt;/strong> capabilities. Traditional image classification models typically require training for specific tasks and labels, whereas CLIP can classify images into categories it has never explicitly seen during training, greatly enhancing the model's generalization ability and flexibility.&lt;/p>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>To understand CLIP, we first need to grasp several core concepts:&lt;/p>
&lt;h3 id="21-multimodal-learning">2.1 Multimodal Learning&lt;/h3>
&lt;p>Multimodal learning refers to the ability of models to process and associate information from different modalities (such as text, images, audio). Humans understand the world by combining visual, auditory, and linguistic information, and multimodal learning aims to give AI similar capabilities. CLIP is an outstanding example of multimodal learning in the domains of images and text.&lt;/p>
&lt;h3 id="22-contrastive-learning">2.2 Contrastive Learning&lt;/h3>
&lt;p>Contrastive learning is a self-supervised learning method. Its core idea is to &lt;strong>bring similar samples closer together in the representation space while pushing dissimilar samples apart&lt;/strong>.&lt;/p>
&lt;p>Imagine a large collection of &amp;ldquo;image-text&amp;rdquo; pairs. For a given image (e.g., a picture of a cat), its corresponding text description (&amp;ldquo;a photo of a cat&amp;rdquo;) is a positive sample, while all other text descriptions (e.g., &amp;ldquo;a photo of a dog&amp;rdquo;, &amp;ldquo;a photo of a car&amp;rdquo;) are negative samples. CLIP's goal is to learn an encoder that makes the representation of &amp;ldquo;a cat picture&amp;rdquo; and &amp;ldquo;a photo of a cat&amp;rdquo; very close in the vector space, while keeping representations of unrelated text descriptions far apart.&lt;/p>
&lt;h3 id="23-zeroshot-learning">2.3 Zero-Shot Learning&lt;/h3>
&lt;p>Zero-shot learning refers to a model's ability to recognize and classify categories it has never seen during training. CLIP achieves this by transforming image classification into an image-text matching problem.&lt;/p>
&lt;p>For example, to determine if an image is a &amp;ldquo;dog,&amp;rdquo; we don't need a model specifically trained to recognize &amp;ldquo;dogs.&amp;rdquo; We simply encode the image into a vector, encode the text &amp;ldquo;a photo of a dog&amp;rdquo; into another vector, and then calculate the similarity between these two vectors. If the similarity is high, we can consider the image to be a &amp;ldquo;dog.&amp;rdquo; This approach allows CLIP to identify objects of any category, as long as we can describe it in text.&lt;/p>
&lt;h2 id="3-model-architecture">3. Model Architecture&lt;/h2>
&lt;p>The CLIP model consists of two main components: an image encoder and a text encoder.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Image Encoder&lt;/strong>: Responsible for converting input images into feature vectors. CLIP uses two mainstream architectures:
&lt;ul>
&lt;li>&lt;strong>ResNet&lt;/strong>: A classic convolutional neural network.&lt;/li>
&lt;li>&lt;strong>Vision Transformer (ViT)&lt;/strong>: A model that applies the Transformer architecture to image recognition.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Text Encoder&lt;/strong>: Responsible for converting input text into feature vectors. CLIP uses the standard &lt;strong>Transformer&lt;/strong> architecture.&lt;/li>
&lt;/ul>
&lt;p>These two encoders map images and text to the same multi-dimensional embedding space, allowing their vector representations to be directly compared.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;CLIP Model&amp;quot;
direction LR
subgraph &amp;quot;Image Encoder (ViT or ResNet)&amp;quot;
I[Image] --&amp;gt; IE(Encoder) --&amp;gt; IV[Image Feature Vector]
end
subgraph &amp;quot;Text Encoder (Transformer)&amp;quot;
T[Text] --&amp;gt; TE(Encoder) --&amp;gt; TV[Text Feature Vector]
end
end
IV -- &amp;quot;Cosine Similarity&amp;quot; --&amp;gt; S(Similarity Score)
TV -- &amp;quot;Cosine Similarity&amp;quot; --&amp;gt; S
&lt;/code>&lt;/pre>
&lt;h2 id="4-workflow">4. Workflow&lt;/h2>
&lt;p>CLIP's workflow is divided into training and inference phases.&lt;/p>
&lt;h3 id="41-training-phase">4.1 Training Phase&lt;/h3>
&lt;p>During the training phase, CLIP learns from a dataset containing hundreds of millions of (image, text) pairs. For a batch of data containing N (image, text) pairs, CLIP performs the following operations:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Encoding&lt;/strong>: Pass N images through the image encoder to get N image feature vectors, and pass N texts through the text encoder to get N text feature vectors.&lt;/li>
&lt;li>&lt;strong>Calculate Similarity&lt;/strong>: Compute the cosine similarity between each of the N image feature vectors and each of the N text feature vectors, resulting in an N x N similarity matrix.&lt;/li>
&lt;li>&lt;strong>Contrastive Learning&lt;/strong>: In this matrix, the elements on the diagonal correspond to the correct (image, text) pairs, which we want to have high similarity. Elements off the diagonal represent mismatched pairs, which we want to have low similarity. The model is optimized through a contrastive loss function to achieve this goal.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Input a batch of (image, text) pairs&amp;quot;] --&amp;gt; B{&amp;quot;Encoding&amp;quot;};
B --&amp;gt; C[&amp;quot;Image Encoder&amp;quot;] --&amp;gt; D[&amp;quot;Image Feature Vectors&amp;quot;];
B --&amp;gt; E[&amp;quot;Text Encoder&amp;quot;] --&amp;gt; F[&amp;quot;Text Feature Vectors&amp;quot;];
D &amp;amp; F --&amp;gt; G{&amp;quot;Calculate Cosine Similarity Matrix&amp;quot;};
G --&amp;gt; H[&amp;quot;Contrastive Loss Function&amp;quot;];
H --&amp;gt; I[&amp;quot;Optimize Model Parameters&amp;quot;];
&lt;/code>&lt;/pre>
&lt;h3 id="42-inference-phase-zeroshot-classification">4.2 Inference Phase (Zero-Shot Classification)&lt;/h3>
&lt;p>During the inference phase, CLIP can perform zero-shot image classification tasks:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Prepare Text Prompts&lt;/strong>: For all categories you want to classify (e.g., &amp;ldquo;cat&amp;rdquo;, &amp;ldquo;dog&amp;rdquo;, &amp;ldquo;car&amp;rdquo;), create a series of text prompts such as &amp;ldquo;a photo of a cat&amp;rdquo;, &amp;ldquo;a photo of a dog&amp;rdquo;, &amp;ldquo;a photo of a car&amp;rdquo;.&lt;/li>
&lt;li>&lt;strong>Encode Text&lt;/strong>: Convert these text prompts into a series of text feature vectors using the text encoder.&lt;/li>
&lt;li>&lt;strong>Encode Image&lt;/strong>: Convert the image to be classified into an image feature vector using the image encoder.&lt;/li>
&lt;li>&lt;strong>Calculate Similarity&lt;/strong>: Compute the cosine similarity between the image feature vector and all text feature vectors.&lt;/li>
&lt;li>&lt;strong>Prediction&lt;/strong>: The category corresponding to the text prompt with the highest similarity is CLIP's prediction result.&lt;/li>
&lt;/ol>
&lt;h2 id="5-applications">5. Applications&lt;/h2>
&lt;p>CLIP's powerful capabilities make it widely applicable in many fields:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Zero-Shot Image Classification&lt;/strong>: Classify images into arbitrary categories without additional training.&lt;/li>
&lt;li>&lt;strong>Image Retrieval&lt;/strong>: Search for matching images using natural language descriptions.&lt;/li>
&lt;li>&lt;strong>Content Moderation&lt;/strong>: Automatically identify and filter inappropriate image content.&lt;/li>
&lt;li>&lt;strong>Guiding Generative Models&lt;/strong>: CLIP's multimodal understanding ability can guide generative models (like DALL-E 2) to create images that match text descriptions.&lt;/li>
&lt;/ul>
&lt;h2 id="6-code-example">6. Code Example&lt;/h2>
&lt;p>Here's a simple Python code example demonstrating how to use the &lt;code>clip&lt;/code> library to load the model and obtain image feature vectors.&lt;/p>
&lt;p>First, install the necessary libraries:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install torch clip
&lt;/code>&lt;/pre>
&lt;p>Then, you can use the following code:&lt;/p>
&lt;pre>&lt;code class="language-python">import torch
import clip
from PIL import Image
# Load the model, can run on either CPU or GPU
device = &amp;quot;cuda&amp;quot; if torch.cuda.is_available() else &amp;quot;cpu&amp;quot;
model, preprocess = clip.load(&amp;quot;ViT-B/32&amp;quot;, device=device)
# Load and preprocess the image
image_path = &amp;quot;cat.jpg&amp;quot; # Replace with your image path
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Prepare text descriptions
text_descriptions = [&amp;quot;a photo of a cat&amp;quot;, &amp;quot;a photo of a dog&amp;quot;]
text_tokens = clip.tokenize(text_descriptions).to(device)
with torch.no_grad():
# Encode images and text
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Calculate similarity
logits_per_image, logits_per_text = model(image, text_tokens)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print(&amp;quot;Label probs:&amp;quot;, probs) # Output the matching probability between the image and each text description
&lt;/code>&lt;/pre>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Through its innovative contrastive learning method, CLIP successfully connects text and images in a shared representation space, demonstrating powerful zero-shot learning capabilities. It has not only achieved excellent results in multiple benchmark tests but has also opened new paths for the development of multimodal artificial intelligence.&lt;/p>
&lt;p>&lt;strong>Advantages&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Strong generalization ability and zero-shot performance.&lt;/li>
&lt;li>No need for fine-tuning for specific tasks, saving significant annotation costs.&lt;/li>
&lt;li>Can understand complex and abstract text descriptions.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Limitations&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>May perform poorly on very fine-grained classification tasks (such as identifying specific bird species).&lt;/li>
&lt;li>Limited understanding of abstract or systematic concepts (such as counting).&lt;/li>
&lt;li>The model's performance is highly dependent on the quality and scale of pre-training data.&lt;/li>
&lt;/ul>
&lt;p>Despite some limitations, CLIP remains one of the most important breakthroughs in artificial intelligence in recent years and continues to push the boundaries of multimodal research.&lt;/p></description></item><item><title>Mixture of Experts (MoE): Sparse Activation Architecture for Large-Scale Neural Networks</title><link>https://ziyanglin.netlify.app/en/post/moe-documentation/</link><pubDate>Fri, 27 Jun 2025 04:02:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/moe-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Mixture of Experts (MoE) is a neural network architecture that dramatically expands model capacity without significantly increasing computational costs by decomposing large models into multiple smaller &amp;ldquo;expert&amp;rdquo; networks and using a &amp;ldquo;gating&amp;rdquo; network to dynamically select the most appropriate subset of experts for each input.&lt;/p>
&lt;p>This approach draws inspiration from expert systems in human society, where specific problems are directed to relevant specialists. In deep learning, this means the model can learn to route different inputs to expert networks specialized in processing that type of data, enabling more efficient and specialized learning.&lt;/p>
&lt;h2 id="2-core-components-macro-and-micro-analysis">2. Core Components: Macro and Micro Analysis&lt;/h2>
&lt;p>From a macro perspective, MoE layers typically serve as efficient alternatives to standard Feed-Forward Network (FFN) layers in Transformer models. While traditional FFN layers apply identical transformations to every token in a sequence, MoE layers introduce the concept of &lt;strong>Conditional Computation&lt;/strong>: for each token, the model dynamically selects a small subset of &amp;ldquo;expert&amp;rdquo; networks to process it, rather than engaging the entire model's parameters. This mechanism allows models to maintain relatively constant computation costs despite having enormous parameter counts.&lt;/p>
&lt;p>An MoE layer consists of two core components: &lt;strong>Expert Networks&lt;/strong> and a &lt;strong>Gating Network&lt;/strong>.
Below is a visualization of the macro architecture of an MoE layer:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph LR
A[Input Token] --&amp;gt; B{Gating Network};
B -- Routing Decision --&amp;gt; C1[Expert 1];
B -- Routing Decision --&amp;gt; C2[Expert 2];
B -- ... --&amp;gt; Cn[Expert n];
C1 --&amp;gt; D[Output];
C2 --&amp;gt; D;
Cn --&amp;gt; D;
&lt;/code>&lt;/pre>
&lt;p>An MoE layer consists of two core components: &lt;strong>Expert Networks&lt;/strong> and a &lt;strong>Gating Network&lt;/strong>.&lt;/p>
&lt;h3 id="21-expert-networks-specialized-processors">2.1. Expert Networks: Specialized Processors&lt;/h3>
&lt;h4 id="underlying-structure-and-variants">Underlying Structure and Variants&lt;/h4>
&lt;p>At the foundational level, each &amp;ldquo;expert&amp;rdquo; is typically an independent feed-forward neural network (FFN). In standard Transformer architectures, an FFN usually consists of two linear layers and a non-linear activation function (such as GeLU or SwiGLU).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Homogeneous Experts&lt;/strong>: In most MoE models, all experts share identical network structures. For example, in the Mixtral 8x7B model, each MoE layer contains 8 structurally identical expert FFNs. This design facilitates implementation and optimization.&lt;/li>
&lt;li>&lt;strong>Heterogeneous Experts&lt;/strong>: Though less common, experts can theoretically be heterogeneous, using different activation functions, hidden layer dimensions, or even more complex structures (like convolutional layers). This might allow the model to learn more diverse features but increases implementation complexity.&lt;/li>
&lt;/ul>
&lt;h4 id="functional-specialization-from-general-to-specialized">Functional Specialization: From General to Specialized&lt;/h4>
&lt;p>During training, although all experts start identical, the routing mechanism of the gating network guides them to develop different &amp;ldquo;specializations.&amp;rdquo; For example, in natural language processing tasks, after sufficient training, we might observe:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Grammar Experts&lt;/strong>: Specialized in processing tokens related to sentence structure, parts of speech, etc.&lt;/li>
&lt;li>&lt;strong>Semantic Experts&lt;/strong>: Focused on understanding word meanings and contextual relationships.&lt;/li>
&lt;li>&lt;strong>Domain-Specific Knowledge Experts&lt;/strong>: For instance, one expert might specialize in &amp;ldquo;legal&amp;rdquo; text, while another becomes more sensitive to &amp;ldquo;biomedical&amp;rdquo; domain knowledge.&lt;/li>
&lt;/ul>
&lt;p>This functional specialization is a key source of MoE models&amp;rsquo; efficiency, as it allows the model to process specific types of information with dedicated subnetworks rather than using a single large, general network for all information.&lt;/p>
&lt;h3 id="22-gating-network-intelligent-routing-and-dispatch-center">2.2. Gating Network: Intelligent Routing and Dispatch Center&lt;/h3>
&lt;p>The gating network is the core decision-making unit of MoE, responsible for assigning the most appropriate experts to each input token.&lt;/p>
&lt;h4 id="technical-details">Technical Details&lt;/h4>
&lt;p>The gating network implementation is typically concise and efficient. Its workflow is as follows:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Generate Logits&lt;/strong>: For the vector representation &lt;code>x&lt;/code> of an input token (typically the output from a self-attention layer), the gating network calculates routing logits through a simple trainable linear layer &lt;code>W_g&lt;/code>: &lt;code>logits = einsum(&amp;quot;d,de-&amp;gt;e&amp;quot;, x, W_g)&lt;/code>, where &lt;code>d&lt;/code> is the token dimension and &lt;code>e&lt;/code> is the number of experts. This operation produces a vector of length &lt;code>e&lt;/code>, with each element representing the &amp;ldquo;score&amp;rdquo; for the corresponding expert.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Top-K Routing Mechanism&lt;/strong>: To achieve sparse computation, tokens are not sent to all experts. The gating network selects the &lt;code>k&lt;/code> highest scores from the logits vector. This &lt;code>k&lt;/code> value is an important hyperparameter; in Mixtral 8x7B, &lt;code>k=2&lt;/code>. This means each token is processed by only the two most relevant experts.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Calculate Gating Weights (Softmax)&lt;/strong>: The selected &lt;code>k&lt;/code> logits are normalized through a Softmax function, generating &lt;code>k&lt;/code> gating weights that determine how to combine the outputs of these &lt;code>k&lt;/code> experts.
&lt;code>weights = softmax(top_k_logits)&lt;/code>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Calculate Final Output&lt;/strong>: The input token &lt;code>x&lt;/code> is sent to the selected &lt;code>k&lt;/code> experts, producing &lt;code>k&lt;/code> expert outputs. The final output is the weighted sum of these &lt;code>k&lt;/code> expert outputs, with weights being the gating weights calculated in the previous step.
&lt;code>output = sum(weights[i] * expert_i(x) for i in top_k_indices)&lt;/code>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>Below is a visualization of this workflow:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Input Token x] --&amp;gt; B{Multiply by Gating Weight Matrix W_g};
B --&amp;gt; C{Calculate Logits};
C --&amp;gt; D{Top-K Selection};
D -- k highest scores --&amp;gt; E{Softmax};
E -- Normalized weights --&amp;gt; F[Weighted Sum];
A -- Send to Top-K experts --&amp;gt; G1[&amp;quot;Expert i processes x&amp;quot;];
A -- Send to Top-K experts --&amp;gt; G2[&amp;quot;Expert j processes x&amp;quot;];
G1 --&amp;gt; F;
G2 --&amp;gt; F;
F --&amp;gt; H[Final Output];
&lt;/code>&lt;/pre>
&lt;h4 id="key-challenge-load-balancing">Key Challenge: Load Balancing&lt;/h4>
&lt;p>A critical challenge for the gating network is the &amp;ldquo;Matthew Effect&amp;rdquo;: some experts may receive more training opportunities due to slightly higher initial weights, becoming stronger and subsequently being selected more frequently, causing other experts to be &amp;ldquo;starved.&amp;rdquo; To address this issue, MoE introduces an &lt;strong>Auxiliary Load Balancing Loss&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Principle&lt;/strong>: This loss function aims to encourage the gating network to distribute tokens as evenly as possible across all experts. It's typically implemented by calculating the square sum of the proportion of tokens assigned to each expert in a batch, multiplied by an adjustable hyperparameter &lt;code>α&lt;/code>. The loss value increases as the distribution becomes more unbalanced.&lt;/li>
&lt;li>&lt;strong>Optimization&lt;/strong>: This auxiliary loss is added to the model's main task loss (such as cross-entropy loss for language models) to form the final total loss function. By optimizing both losses during backpropagation, the model is incentivized to maintain load balance among experts while completing its main task.&lt;/li>
&lt;/ul>
&lt;h2 id="3-moe-model-training-methods-addressing-scale-challenges">3. MoE Model Training Methods: Addressing Scale Challenges&lt;/h2>
&lt;p>Due to the enormous parameter count of MoE models (despite sparse computation), their training poses significant challenges to computational resources, especially memory. To effectively train MoE models, complex parallelization strategies must be employed.&lt;/p>
&lt;h3 id="31-expert-parallelism">3.1. Expert Parallelism&lt;/h3>
&lt;p>This is the core parallelization strategy for training MoE models.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Distribute different experts across different computing devices (such as GPUs). For example, in a scenario with an MoE layer containing 8 experts and 8 GPUs, each GPU is responsible for storing and computing one expert. Other parts of the model (such as self-attention layers) can be replicated on each GPU.&lt;/li>
&lt;li>&lt;strong>Workflow and Communication Overhead&lt;/strong>: In each forward pass, tokens from various GPUs, after being processed by the gating network, need to be sent to the GPUs storing the corresponding experts based on routing decisions. This process involves a global &lt;strong>All-to-All&lt;/strong> communication operation, where each GPU needs to send and receive data to and from all other GPUs. After computation, results are sent back to the original GPUs through another All-to-All communication. This intensive communication is the main performance bottleneck in expert parallel mode.&lt;/li>
&lt;/ul>
&lt;h3 id="32-combining-with-other-parallelism-strategies">3.2. Combining with Other Parallelism Strategies&lt;/h3>
&lt;p>To address different scales of models and hardware configurations, expert parallelism often needs to be combined with other parallelism strategies:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Data Parallelism&lt;/strong>: This is the most common parallelism approach. When the number of GPUs exceeds the number of experts, multiple GPUs can form a data parallel group, with each group containing a complete set of experts (distributed through expert parallelism). For example, with 64 GPUs and 8 experts, 8 data parallel groups can be created, each with 8 GPUs, with each GPU responsible for one expert.&lt;/li>
&lt;li>&lt;strong>Model Parallelism and Pipeline Parallelism&lt;/strong>: For ultra-large models where even a single expert or non-MoE layer cannot fit into a single GPU, tensor model parallelism and pipeline parallelism need to be introduced to further split the model.&lt;/li>
&lt;/ul>
&lt;p>In summary, training MoE is a complex multi-dimensional parallel engineering task that requires careful design of parallelism strategies based on factors such as model size, number of experts, number of GPUs, and network bandwidth.&lt;/p>
&lt;h2 id="4-advantages-of-moe">4. Advantages of MoE&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Enormous Model Capacity&lt;/strong>: MoE allows models to have massive parameters (e.g., trillions of parameters) without needing to compute all parameters in each forward pass. This enables the model to learn more complex and detailed knowledge.&lt;/li>
&lt;li>&lt;strong>Controllable Computational Cost&lt;/strong>: Due to the sparse activation strategy (activating only a few experts), the training and inference costs of MoE models are comparable to dense models with far fewer total parameters.&lt;/li>
&lt;li>&lt;strong>Faster Training and Inference&lt;/strong>: Under the same computational budget, MoE models typically converge faster and have faster inference speeds compared to dense models.&lt;/li>
&lt;/ul>
&lt;h2 id="5-challenges-of-moe">5. Challenges of MoE&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Training Instability&lt;/strong>: The gating network may tend to always select a few &amp;ldquo;popular&amp;rdquo; experts, preventing other experts from being adequately trained. To address this issue, a &amp;ldquo;load balancing loss&amp;rdquo; is typically introduced to encourage the gating network to distribute inputs evenly across all experts.&lt;/li>
&lt;li>&lt;strong>High Communication Cost&lt;/strong>: In distributed training, since different experts may be distributed across different computing devices, routing input data from the gating network to selected experts incurs significant communication overhead.&lt;/li>
&lt;li>&lt;strong>Complex Implementation&lt;/strong>: Compared to standard dense models, MoE models are more complex to implement and deploy, requiring specialized parallel computing strategies and hardware support.&lt;/li>
&lt;li>&lt;strong>Memory Consumption&lt;/strong>: Although computation is sparse, all parameters of the model (all experts) need to be stored in memory, placing high demands on hardware.&lt;/li>
&lt;/ul>
&lt;h2 id="6-key-technologies-and-recent-advances">6. Key Technologies and Recent Advances&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Switch Transformers&lt;/strong>: This is a simplified MoE architecture proposed by Google that simplifies the top-k strategy to top-1, meaning each token is routed to only one expert. This design greatly simplifies routing logic and reduces communication costs.&lt;/li>
&lt;li>&lt;strong>GShard&lt;/strong>: This is a system for training MoE models on ultra-large-scale clusters. It effectively addresses the communication bottleneck in MoE training through clever data and model parallelism strategies.&lt;/li>
&lt;li>&lt;strong>Expert Capacity Factor&lt;/strong>: To handle load imbalance issues, a &amp;ldquo;capacity&amp;rdquo; can be set for each expert, defining the maximum number of tokens it can process in a batch. If an expert is selected more times than its capacity, excess tokens will be &amp;ldquo;dropped&amp;rdquo; or routed to other experts.&lt;/li>
&lt;li>&lt;strong>Latest Routing Strategies&lt;/strong>: Researchers are exploring more advanced routing strategies, such as allowing tokens to be routed to multiple experts with weighted combination of their outputs, or using more complex gating networks to make smarter routing decisions.&lt;/li>
&lt;li>&lt;strong>Applications in Computer Vision&lt;/strong>: MoE is not limited to NLP; it has also been successfully applied to computer vision tasks such as pose estimation, enhancing model performance by training specialized experts for different datasets or pose types.&lt;/li>
&lt;/ul>
&lt;h2 id="7-summary-and-outlook">7. Summary and Outlook&lt;/h2>
&lt;p>MoE models have successfully achieved massive model scaling at controllable computational costs by introducing sparsely activated expert networks, becoming a key technology for building ultra-large-scale language and vision models.&lt;/p>
&lt;p>Despite challenges in training stability and communication overhead, with the continued maturation of technologies like Switch Transformers and GShard, as well as the emergence of new routing strategies and hardware optimizations, the application prospects for MoE are increasingly broad. In the future, we can expect to see more, larger, and more efficient MoE models playing important roles across various domains.&lt;/p></description></item><item><title>LLM Hyperparameter Tuning Guide: A Comprehensive Analysis from Generation to Deployment</title><link>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</link><pubDate>Fri, 27 Jun 2025 03:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h2 id="span-stylefontsize-09embehind-the-powerful-capabilities-of-large-language-models-llms-is-a-series-of-complex-hyperparameters-working-silently-whether-youre-deploying-a-local-inference-service-like-vllm-or-calling-openais-api-precisely-tuning-these-parameters-is-crucial-for-achieving-ideal-performance-cost-and-output-quality-this-document-provides-a-detailed-analysis-of-two-key-categories-of-hyperparameters-generation-sampling-parameters-and-deployment-serving-parameters-helping-you-fully-master-their-functions-values-impacts-and-best-practices-across-different-scenariosspan">&lt;span style="font-size: 0.9em;">Behind the powerful capabilities of large language models (LLMs) is a series of complex hyperparameters working silently. Whether you're deploying a local inference service like vLLM or calling OpenAI's API, precisely tuning these parameters is crucial for achieving ideal performance, cost, and output quality. This document provides a detailed analysis of two key categories of hyperparameters: &lt;strong>Generation (Sampling) Parameters&lt;/strong> and &lt;strong>Deployment (Serving) Parameters&lt;/strong>, helping you fully master their functions, values, impacts, and best practices across different scenarios.&lt;/span>&lt;/h2>
&lt;h3 id="part-1-generation-sampling-parameters--controlling-model-creativity-and-determinism">Part 1: Generation (Sampling) Parameters — Controlling Model Creativity and Determinism&lt;/h3>
&lt;p>Generation parameters directly control the model's behavior when generating the next token. They primarily revolve around a core question: how to select from thousands of possible next words in the probability distribution provided by the model.&lt;/p>
&lt;h3 id="1-temperature">1. &lt;code>temperature&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the randomness of generated text. Higher &lt;code>temperature&lt;/code> increases randomness, making responses more creative and diverse; lower &lt;code>temperature&lt;/code> decreases randomness, making responses more deterministic and conservative.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
When generating the next token, the model calculates &lt;code>logits&lt;/code> (raw, unnormalized prediction scores) for all words in the vocabulary. Typically, we use the &lt;code>Softmax&lt;/code> function to convert these &lt;code>logits&lt;/code> into a probability distribution. The &lt;code>temperature&lt;/code> parameter is introduced before the &lt;code>Softmax&lt;/code> calculation, &amp;ldquo;smoothing&amp;rdquo; or &amp;ldquo;sharpening&amp;rdquo; this probability distribution.&lt;/p>
&lt;p>The standard Softmax formula is: &lt;code>P(i) = exp(logit_i) / Σ_j(exp(logit_j))&lt;/code>&lt;/p>
&lt;p>With &lt;code>temperature&lt;/code> (T) introduced, the formula becomes: &lt;code>P(i) = exp(logit_i / T) / Σ_j(exp(logit_j / T))&lt;/code>&lt;/p>
&lt;ul>
&lt;li>When &lt;code>T&lt;/code> -&amp;gt; 0, the differences in &lt;code>logit_i / T&lt;/code> become dramatically amplified. The token with the highest logit approaches a probability of 1, while all other tokens approach 0. This causes the model to almost always choose the most likely word, behaving very deterministically and &amp;ldquo;greedily.&amp;rdquo;&lt;/li>
&lt;li>When &lt;code>T&lt;/code> = 1, the formula reverts to standard Softmax, and the model behaves in its &amp;ldquo;original&amp;rdquo; state.&lt;/li>
&lt;li>When &lt;code>T&lt;/code> &amp;gt; 1, the differences in &lt;code>logit_i / T&lt;/code> are reduced. Tokens with originally lower probabilities get boosted, making the entire probability distribution &amp;ldquo;flatter.&amp;rdquo; This increases the chance of selecting less common words, introducing more randomness and creativity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>[0.0, 2.0]&lt;/code> (theoretically can be higher, but OpenAI API typically limits to 2.0).&lt;/li>
&lt;li>&lt;strong>&lt;code>temperature&lt;/code> = 0.0:&lt;/strong> Suitable for scenarios requiring deterministic, reproducible, and highly accurate outputs. Examples: code generation, factual Q&amp;amp;A, text classification, data extraction. With identical inputs, outputs will be almost identical (unless the model itself is updated).&lt;/li>
&lt;li>&lt;strong>Low &lt;code>temperature&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.4&lt;/code>):&lt;/strong> Suitable for semi-creative tasks requiring rigor and fidelity to source material. Examples: article summarization, translation, customer service bots. Outputs will vary slightly but remain faithful to core content.&lt;/li>
&lt;li>&lt;strong>Medium &lt;code>temperature&lt;/code> (e.g., &lt;code>0.5&lt;/code> - &lt;code>0.8&lt;/code>):&lt;/strong> A good balance between creativity and consistency, recommended as the default for most applications. Examples: writing emails, marketing copy, brainstorming.&lt;/li>
&lt;li>&lt;strong>High &lt;code>temperature&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.5&lt;/code>):&lt;/strong> Suitable for highly creative tasks. Examples: poetry writing, story creation, dialogue script generation. Outputs will be very diverse and sometimes surprising, but may occasionally produce meaningless or incoherent content.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Note:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>It's generally not recommended to modify both &lt;code>temperature&lt;/code> and &lt;code>top_p&lt;/code> simultaneously; it's better to adjust just one. OpenAI's documentation explicitly states that modifying only one is typically advised.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-topp-nucleus-sampling">2. &lt;code>top_p&lt;/code> (Nucleus Sampling)&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls generation diversity by dynamically determining the sampling pool size through a cumulative probability threshold (&lt;code>p&lt;/code>) of the highest probability tokens.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
&lt;code>top_p&lt;/code> is a more intelligent sampling strategy than &lt;code>temperature&lt;/code>, also known as &lt;strong>Nucleus Sampling&lt;/strong>. Instead of adjusting all token probabilities, it directly defines a &amp;ldquo;core&amp;rdquo; candidate set.&lt;/p>
&lt;p>The specific steps are as follows:&lt;/p>
&lt;ol>
&lt;li>The model calculates the probability distribution for all candidate tokens.&lt;/li>
&lt;li>All tokens are sorted by probability from highest to lowest.&lt;/li>
&lt;li>Starting from the highest probability token, their probabilities are cumulatively added until this sum exceeds the set &lt;code>top_p&lt;/code> threshold.&lt;/li>
&lt;li>All tokens included in this cumulative sum form the &amp;ldquo;nucleus&amp;rdquo; for sampling.&lt;/li>
&lt;li>The model will only sample from this nucleus (typically renormalizing their probabilities), and all other tokens are ignored.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Example:&lt;/strong> Assume &lt;code>top_p&lt;/code> = &lt;code>0.9&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>If the highest probability token &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.95&lt;/code>, then the nucleus will contain only &amp;ldquo;the&amp;rdquo;, and the model will choose it 100%.&lt;/li>
&lt;li>If &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.5&lt;/code>, &amp;ldquo;a&amp;rdquo; has &lt;code>0.3&lt;/code>, and &amp;ldquo;an&amp;rdquo; has &lt;code>0.1&lt;/code>, then the cumulative probability of these three words is &lt;code>0.9&lt;/code>. The nucleus will contain {&amp;ldquo;the&amp;rdquo;, &amp;ldquo;a&amp;rdquo;, &amp;ldquo;an&amp;rdquo;}. The model will sample from these three words according to their (renormalized) probabilities.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_p&lt;/code> = 1.0:&lt;/strong> Means the model considers all tokens without any truncation (equivalent to no &lt;code>top_p&lt;/code>).&lt;/li>
&lt;li>&lt;strong>High &lt;code>top_p&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.0&lt;/code>):&lt;/strong> Allows for more diverse choices, suitable for creative tasks, similar in effect to higher &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Low &lt;code>top_p&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.3&lt;/code>):&lt;/strong> Greatly restricts the model's range of choices, making its output very deterministic and conservative, similar in effect to extremely low &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>General Recommended Value:&lt;/strong> &lt;code>0.9&lt;/code> is a very common default value as it maintains high quality while allowing for some diversity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>top_p&lt;/code> vs &lt;code>temperature&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>top_p&lt;/code> is more dynamic and adaptive. When the model is very confident about the next step (sharp probability distribution), &lt;code>top_p&lt;/code> automatically narrows the candidate set, ensuring quality. When the model is less confident (flat distribution), it expands the candidate set, increasing diversity.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code> adjusts the entire distribution &amp;ldquo;equally,&amp;rdquo; regardless of whether the distribution itself is sharp or flat.&lt;/li>
&lt;li>Therefore, &lt;code>top_p&lt;/code> is generally considered a safer and more robust method for controlling diversity than &lt;code>temperature&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-topk">3. &lt;code>top_k&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Simply and directly samples only from the &lt;code>k&lt;/code> tokens with the highest probabilities.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> This is the simplest truncation sampling method. It directly selects the &lt;code>k&lt;/code> tokens with the highest probabilities to form the candidate set, then samples from these &lt;code>k&lt;/code> tokens. All other tokens are ignored.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Integers, such as &lt;code>1&lt;/code>, &lt;code>10&lt;/code>, &lt;code>50&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_k&lt;/code> = 1:&lt;/strong> Equivalent to greedy search, always choosing the most likely word.&lt;/li>
&lt;li>&lt;strong>Recommendation:&lt;/strong> &lt;code>top_k&lt;/code> is typically not the preferred sampling strategy because it's too &amp;ldquo;rigid.&amp;rdquo; In cases where the probability distribution is very flat, it might accidentally exclude many reasonable words; while in cases where the distribution is very sharp, it might include many extremely low-probability, useless words. &lt;code>top_p&lt;/code> is usually a better choice.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-repetitionpenalty">4. &lt;code>repetition_penalty&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Applies a penalty to tokens that have already appeared in the context, reducing their probability of being selected again, thereby reducing repetitive content.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> After calculating &lt;code>logits&lt;/code> but before &lt;code>Softmax&lt;/code>, this parameter iterates through all candidate tokens. If a token has already appeared in the previous context, its &lt;code>logit&lt;/code> value is reduced (typically divided by the value of &lt;code>repetition_penalty&lt;/code>).&lt;/p>
&lt;p>&lt;code>new_logit = logit / penalty&lt;/code> (if token has appeared)
&lt;code>new_logit = logit&lt;/code> (if token has not appeared)&lt;/p>
&lt;p>This way, the final probability of words that have already appeared decreases.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>1.0&lt;/code> to &lt;code>2.0&lt;/code> is common.&lt;/li>
&lt;li>&lt;strong>&lt;code>1.0&lt;/code>:&lt;/strong> No penalty applied (default value).&lt;/li>
&lt;li>&lt;strong>&lt;code>1.1&lt;/code> - &lt;code>1.3&lt;/code>:&lt;/strong> A relatively safe range that can effectively reduce unnecessary repetition without overly affecting normal language expression (such as necessary articles like &amp;ldquo;the&amp;rdquo;).&lt;/li>
&lt;li>&lt;strong>Too High Values:&lt;/strong> May cause the model to deliberately avoid common words, producing unnatural or even strange sentences.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="5-frequencypenalty--presencepenalty">5. &lt;code>frequency_penalty&lt;/code> &amp;amp; &lt;code>presence_penalty&lt;/code>&lt;/h3>
&lt;p>These two parameters are more refined versions of &lt;code>repetition_penalty&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>presence_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Applies a fixed penalty to all tokens that have &lt;strong>appeared at least once&lt;/strong> in the context. It doesn't care how many times the token has appeared; as long as it has appeared, it gets penalized.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - presence_penalty&lt;/code> (if token has appeared at least once).&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is useful when you want to encourage the model to introduce entirely new concepts and vocabulary, rather than repeatedly discussing topics that have already been mentioned.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>. Positive values penalize new tokens, negative values encourage them.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>frequency_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> The penalty is proportional to the &lt;strong>frequency&lt;/strong> of the token in the context. The more times a word appears, the heavier the penalty it receives.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - count(token) * frequency_penalty&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is effective when you find the model tends to repeatedly use certain specific high-frequency words (even if they are necessary), leading to monotonous language.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Summary:&lt;/strong> &lt;code>presence_penalty&lt;/code> addresses the question of &amp;ldquo;whether it has appeared,&amp;rdquo; while &lt;code>frequency_penalty&lt;/code> addresses &amp;ldquo;how many times it has appeared.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="6-seed">6. &lt;code>seed&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> By providing a fixed &lt;code>seed&lt;/code>, you can make the model's output reproducible when other parameters (such as &lt;code>temperature&lt;/code>) remain the same.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> In machine learning, many operations that seem random are actually &amp;ldquo;pseudo-random,&amp;rdquo; determined by an initial &amp;ldquo;seed.&amp;rdquo; Setting the same seed will produce the same sequence of random numbers. In LLMs, this means the sampling process will be completely deterministic.&lt;/li>
&lt;li>&lt;strong>Scenarios:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Debugging and Testing:&lt;/strong> When you need to verify whether a change has affected the output, fixing the &lt;code>seed&lt;/code> can eliminate randomness interference.&lt;/li>
&lt;li>&lt;strong>Reproducible Research:&lt;/strong> Reproducibility is crucial in academic research.&lt;/li>
&lt;li>&lt;strong>Generating Consistent Content:&lt;/strong> When you need the model to consistently produce outputs in the same style for the same input.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Note:&lt;/strong> For complete reproduction, &lt;strong>all&lt;/strong> generation parameters (&lt;code>prompt&lt;/code>, &lt;code>model&lt;/code>, &lt;code>temperature&lt;/code>, &lt;code>top_p&lt;/code>, etc.) must be identical.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="part-2-deployment-serving-parameters--optimizing-service-performance-and-capacity">Part 2: Deployment (Serving) Parameters — Optimizing Service Performance and Capacity&lt;/h3>
&lt;p>Deployment parameters determine how an LLM inference service manages GPU resources, handles concurrent requests, and optimizes overall throughput and latency. These parameters are particularly important in high-performance inference engines like vLLM.&lt;/p>
&lt;h3 id="1-gpumemoryutilization">1. &lt;code>gpu_memory_utilization&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the proportion of GPU memory that vLLM can use, with the core purpose of reserving space for the &lt;strong>KV Cache&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle (PagedAttention):&lt;/strong>
The core of vLLM is the PagedAttention mechanism. Traditional attention mechanisms pre-allocate a continuous, maximum-length memory space for each request to store the Key-Value (KV) Cache. This leads to severe memory waste, as most requests are far shorter than the maximum length.&lt;/p>
&lt;p>PagedAttention manages the KV Cache like virtual memory in an operating system:&lt;/p>
&lt;ol>
&lt;li>It breaks down each sequence's KV Cache into many small, fixed-size &amp;ldquo;blocks.&amp;rdquo;&lt;/li>
&lt;li>These blocks can be stored non-contiguously in GPU memory.&lt;/li>
&lt;li>A central &amp;ldquo;Block Manager&amp;rdquo; is responsible for allocating and releasing these blocks.&lt;/li>
&lt;/ol>
&lt;p>&lt;code>gpu_memory_utilization&lt;/code> tells vLLM: &amp;ldquo;You can use this much proportion of the total GPU memory for free management (mainly storing model weights and physical blocks of KV Cache).&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Default Value:&lt;/strong> &lt;code>0.9&lt;/code> (i.e., 90%).&lt;/li>
&lt;li>&lt;strong>Higher Values (e.g., &lt;code>0.95&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> vLLM has more memory for KV Cache, supporting longer contexts and larger batch sizes, thereby increasing throughput.&lt;/li>
&lt;li>&lt;strong>Risk:&lt;/strong> If set too high, there might not be enough spare memory for CUDA kernels, drivers, or other system processes, easily leading to &lt;strong>OOM (Out of Memory)&lt;/strong> errors.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values (e.g., &lt;code>0.8&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Safer, less prone to OOM, reserves more memory for the system and other applications.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Reduced available space for KV Cache, potentially causing vLLM to struggle with high concurrency or long sequence requests, degrading performance. When KV Cache is insufficient, vLLM triggers &lt;strong>Preemption&lt;/strong>, swapping out some running sequences and waiting to swap them back in when there's enough space, severely affecting latency. vLLM's warning log &lt;code>&amp;quot;there is not enough KV cache space. This can affect the end-to-end performance.&amp;quot;&lt;/code> is reminding you of this issue.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Start with the default value of &lt;code>0.9&lt;/code>.&lt;/li>
&lt;li>If you encounter OOM, gradually lower this value.&lt;/li>
&lt;li>If you encounter many preemption warnings and confirm no other processes are occupying large amounts of GPU memory, you can gradually increase this value.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-maxnumseqs">2. &lt;code>max_num_seqs&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Limits the maximum number of sequences (requests) that the vLLM scheduler can process &lt;strong>in one iteration (or one batch)&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
vLLM's scheduler selects a batch of requests from the waiting queue in each processing cycle. This parameter directly limits the size of this &amp;ldquo;batch.&amp;rdquo; Together with &lt;code>max_num_batched_tokens&lt;/code> (which limits the total number of tokens across all sequences in a batch), it determines the scale of batch processing.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, such as &lt;code>16&lt;/code>, &lt;code>64&lt;/code>, &lt;code>256&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Allows for higher concurrency, potentially improving GPU utilization and overall throughput.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Requires more intermediate memory (e.g., for storing &lt;code>logits&lt;/code> and sampling states) and may increase the latency of individual batches. If set too high, even if KV Cache still has space, OOM might occur due to insufficient temporary memory.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> More memory-friendly, potentially lower latency for individual batches.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Limits concurrency capability, potentially leading to underutilization of GPU and decreased throughput.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This value needs to be adjusted based on your GPU memory size, model size, and expected concurrent load.&lt;/li>
&lt;li>For high-concurrency scenarios, try gradually increasing this value while monitoring GPU utilization and memory usage.&lt;/li>
&lt;li>For interactive, low-latency scenarios, consider setting this value lower.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-maxmodellen">3. &lt;code>max_model_len&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Sets the &lt;strong>maximum context length&lt;/strong> the model can process (including both prompt and generated tokens).&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
This parameter directly determines how much logical space vLLM needs to reserve for the KV Cache. For example, if &lt;code>max_model_len&lt;/code> = &lt;code>4096&lt;/code>, vLLM must ensure its memory management mechanism can support storing KV pairs for up to &lt;code>4096&lt;/code> tokens per sequence.
This affects vLLM's memory planning at startup, such as the size of Position Embeddings.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, cannot exceed the maximum length the model was originally trained on.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can handle longer documents and more complex contexts.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> &lt;strong>Significantly increases&lt;/strong> memory consumption. Each token needs to store KV Cache; doubling the length roughly doubles the memory usage. Even if current requests are short, vLLM needs to prepare for potentially long requests, which occupies more KV Cache blocks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> &lt;strong>Significantly saves&lt;/strong> GPU memory. If you know your application scenario will never exceed 1024 tokens, setting this value to 1024 instead of the default 4096 or 8192 will free up a large amount of KV Cache space, supporting higher concurrency.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Any requests exceeding this length will be rejected or truncated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Set as needed!&lt;/strong> This is one of the most effective parameters for optimizing vLLM memory usage. Based on your actual application scenario, set this value to a reasonable maximum with some margin.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-tensorparallelsize--pipelineparallelsize">4. &lt;code>tensor_parallel_size&lt;/code> &amp;amp; &lt;code>pipeline_parallel_size&lt;/code>&lt;/h3>
&lt;p>These two parameters are used for deploying extremely large models across multiple GPUs or nodes.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>tensor_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Divides &lt;strong>each layer&lt;/strong> of the model (such as a large weight matrix) into &lt;code>N&lt;/code> parts (&lt;code>N&lt;/code> = &lt;code>tensor_parallel_size&lt;/code>), placing them on &lt;code>N&lt;/code> different GPUs. During computation, each GPU only processes its own portion of the data, then exchanges necessary results through high-speed interconnects (like NVLink) via All-Reduce operations, finally merging to get the complete output.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when a single model's volume exceeds the memory of a single GPU. For example, a 70B model cannot fit into a single 40GB A100, but can be deployed across two A100s by setting &lt;code>tensor_parallel_size=2&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Achieves model parallelism, solving the problem of models not fitting on a single card.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Introduces significant cross-GPU communication overhead, potentially affecting latency. Requires high-speed interconnects between GPUs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>pipeline_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Assigns &lt;strong>different layers&lt;/strong> of the model to different GPUs or nodes. For example, placing layers 1-10 on GPU 1, layers 11-20 on GPU 2, and so on. Data flows through these GPUs like a pipeline.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when the model is extremely large and needs to be deployed across multiple nodes (machines).&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can scale the model to any number of GPUs/nodes.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Creates &amp;ldquo;pipeline bubbles&amp;rdquo; as additional overhead, where some GPUs are idle during the start and end phases of the pipeline, reducing utilization.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Combined Use:&lt;/strong>
vLLM supports using both parallelism strategies simultaneously for efficient deployment of giant models on large clusters.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="summary-and-best-practices">Summary and Best Practices&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Scenario&lt;/th>
&lt;th align="left">&lt;code>temperature&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>top_p&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>repetition_penalty&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>gpu_memory_utilization&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_num_seqs&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_model_len&lt;/code>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Code Generation/Factual Q&amp;amp;A&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.0&lt;/code> - &lt;code>0.2&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Article Summarization/Translation&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.2&lt;/code> - &lt;code>0.5&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set to maximum possible document length&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>General Chat/Copywriting&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.7&lt;/code> (Default)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Recommended)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed, e.g., &lt;code>4096&lt;/code>|&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Creative Writing/Brainstorming&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.8&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>High Concurrency Throughput Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Try &lt;code>0.9&lt;/code> - &lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">Gradually increase&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Low Latency Interaction Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Set to lower values (e.g., &lt;code>16-64&lt;/code>)&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Extremely Memory Constrained&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Lower to &lt;code>0.8&lt;/code>&lt;/td>
&lt;td align="left">Set to lower values&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Final Recommendations:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Start with Generation Parameters:&lt;/strong> First adjust &lt;code>temperature&lt;/code> or &lt;code>top_p&lt;/code> to achieve satisfactory output quality.&lt;/li>
&lt;li>&lt;strong>Set Deployment Parameters as Needed:&lt;/strong> When deploying, first set &lt;code>max_model_len&lt;/code> to a reasonable minimum value based on your application scenario.&lt;/li>
&lt;li>&lt;strong>Monitor and Iterate:&lt;/strong> Start with the default &lt;code>gpu_memory_utilization=0.9&lt;/code> and a moderate &lt;code>max_num_seqs&lt;/code>. Observe memory usage and preemption situations through monitoring tools (such as &lt;code>nvidia-smi&lt;/code> and vLLM logs), then gradually adjust these values to find the optimal balance for your specific hardware and workload.&lt;/li>
&lt;/ol></description></item><item><title>Ollama Practical Guide: Local Deployment and Management of Large Language Models</title><link>https://ziyanglin.netlify.app/en/post/ollama-documentation/</link><pubDate>Fri, 27 Jun 2025 02:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/ollama-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Ollama is a powerful open-source tool designed to allow users to easily download, run, and manage large language models (LLMs) in local environments. Its core advantage lies in simplifying the deployment and use of complex models, enabling developers, researchers, and enthusiasts to experience and utilize state-of-the-art artificial intelligence technology on personal computers without specialized hardware or complex configurations.&lt;/p>
&lt;p>&lt;strong>Key Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Ease of Use:&lt;/strong> Complete model download, running, and interaction through simple command-line instructions.&lt;/li>
&lt;li>&lt;strong>Cross-Platform Support:&lt;/strong> Supports macOS, Windows, and Linux.&lt;/li>
&lt;li>&lt;strong>Rich Model Library:&lt;/strong> Supports numerous popular open-source models such as Llama 3, Mistral, Gemma, Phi-3, and more.&lt;/li>
&lt;li>&lt;strong>Highly Customizable:&lt;/strong> Through &lt;code>Modelfile&lt;/code>, users can easily customize model behavior, system prompts, and parameters.&lt;/li>
&lt;li>&lt;strong>API-Driven:&lt;/strong> Provides a REST API for easy integration with other applications and services.&lt;/li>
&lt;li>&lt;strong>Open Source Community:&lt;/strong> Has an active community continuously contributing new models and features.&lt;/li>
&lt;/ul>
&lt;p>This document will provide a comprehensive introduction to Ollama's various features, from basic fundamentals to advanced applications, helping you fully master this powerful tool.&lt;/p>
&lt;hr>
&lt;h2 id="2-quick-start">2. Quick Start&lt;/h2>
&lt;p>This section will guide you through installing and basic usage of Ollama.&lt;/p>
&lt;h3 id="21-installation">2.1 Installation&lt;/h3>
&lt;p>Visit the &lt;a href="https://ollama.com/">Ollama official website&lt;/a> to download and install the package suitable for your operating system.&lt;/p>
&lt;h3 id="22-running-your-first-model">2.2 Running Your First Model&lt;/h3>
&lt;p>After installation, open a terminal (or command prompt) and use the &lt;code>ollama run&lt;/code> command to download and run a model. For example, to run the Llama 3 model:&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama run llama3
&lt;/code>&lt;/pre>
&lt;p>On first run, Ollama will automatically download the required model files from the model library. Once the download is complete, you can directly converse with the model in the terminal.&lt;/p>
&lt;h3 id="23-managing-local-models">2.3 Managing Local Models&lt;/h3>
&lt;p>You can use the following commands to manage locally downloaded models:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>List Local Models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama list
&lt;/code>&lt;/pre>
&lt;p>This command displays the name, ID, size, and modification time of all downloaded models.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Remove Local Models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama rm &amp;lt;model_name&amp;gt;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="3-core-concepts">3. Core Concepts&lt;/h2>
&lt;h3 id="31-modelfile">3.1 Modelfile&lt;/h3>
&lt;p>&lt;code>Modelfile&lt;/code> is one of Ollama's core features. It's a configuration file similar to &lt;code>Dockerfile&lt;/code> that allows you to define and create custom models. Through &lt;code>Modelfile&lt;/code>, you can:&lt;/p>
&lt;ul>
&lt;li>Specify a base model.&lt;/li>
&lt;li>Set model parameters (such as temperature, top_p, etc.).&lt;/li>
&lt;li>Define the model's system prompt.&lt;/li>
&lt;li>Customize the model's interaction template.&lt;/li>
&lt;li>Apply LoRA adapters.&lt;/li>
&lt;/ul>
&lt;p>A simple &lt;code>Modelfile&lt;/code> example:&lt;/p>
&lt;pre>&lt;code class="language-Modelfile"># Specify base model
FROM llama3
# Set model temperature
PARAMETER temperature 0.8
# Set system prompt
SYSTEM &amp;quot;&amp;quot;&amp;quot;
You are a helpful AI assistant. Your name is Roo.
&amp;quot;&amp;quot;&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>Use the &lt;code>ollama create&lt;/code> command to create a new model based on a &lt;code>Modelfile&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama create my-custom-model -f ./Modelfile
&lt;/code>&lt;/pre>
&lt;h3 id="32-model-import">3.2 Model Import&lt;/h3>
&lt;p>Ollama supports importing models from external file systems, particularly from &lt;code>Safetensors&lt;/code> format weight files.&lt;/p>
&lt;p>In a &lt;code>Modelfile&lt;/code>, use the &lt;code>FROM&lt;/code> directive and provide the directory path containing &lt;code>safetensors&lt;/code> files:&lt;/p>
&lt;pre>&lt;code class="language-Modelfile">FROM /path/to/safetensors/directory
&lt;/code>&lt;/pre>
&lt;p>Then use the &lt;code>ollama create&lt;/code> command to create the model.&lt;/p>
&lt;h3 id="33-multimodal-models">3.3 Multimodal Models&lt;/h3>
&lt;p>Ollama supports multimodal models (such as LLaVA) that can process both text and image inputs simultaneously.&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama run llava &amp;quot;What's in this image? /path/to/image.png&amp;quot;
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="4-api-reference">4. API Reference&lt;/h2>
&lt;p>Ollama provides a set of REST APIs for programmatically interacting with models. The default service address is &lt;code>http://localhost:11434&lt;/code>.&lt;/p>
&lt;h3 id="41-apigenerate">4.1 &lt;code>/api/generate&lt;/code>&lt;/h3>
&lt;p>Generate text.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request (Streaming):&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/generate -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Why is the sky blue?&amp;quot;
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Request (Non-streaming):&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/generate -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Why is the sky blue?&amp;quot;,
&amp;quot;stream&amp;quot;: false
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="42-apichat">4.2 &lt;code>/api/chat&lt;/code>&lt;/h3>
&lt;p>Conduct multi-turn conversations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/chat -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;messages&amp;quot;: [
{
&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,
&amp;quot;content&amp;quot;: &amp;quot;why is the sky blue?&amp;quot;
}
],
&amp;quot;stream&amp;quot;: false
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="43-apiembed">4.3 &lt;code>/api/embed&lt;/code>&lt;/h3>
&lt;p>Generate embedding vectors for text.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/embed -d '{
&amp;quot;model&amp;quot;: &amp;quot;all-minilm&amp;quot;,
&amp;quot;input&amp;quot;: [&amp;quot;Why is the sky blue?&amp;quot;, &amp;quot;Why is the grass green?&amp;quot;]
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="44-apitags">4.4 &lt;code>/api/tags&lt;/code>&lt;/h3>
&lt;p>List all locally available models.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/tags
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="5-command-line-tools-cli">5. Command Line Tools (CLI)&lt;/h2>
&lt;p>Ollama provides a rich set of command-line tools for managing models and interacting with the service.&lt;/p>
&lt;ul>
&lt;li>&lt;code>ollama run &amp;lt;model&amp;gt;&lt;/code>: Run a model.&lt;/li>
&lt;li>&lt;code>ollama create &amp;lt;model&amp;gt; -f &amp;lt;Modelfile&amp;gt;&lt;/code>: Create a model from a Modelfile.&lt;/li>
&lt;li>&lt;code>ollama pull &amp;lt;model&amp;gt;&lt;/code>: Pull a model from a remote repository.&lt;/li>
&lt;li>&lt;code>ollama push &amp;lt;model&amp;gt;&lt;/code>: Push a model to a remote repository.&lt;/li>
&lt;li>&lt;code>ollama list&lt;/code>: List local models.&lt;/li>
&lt;li>&lt;code>ollama cp &amp;lt;source_model&amp;gt; &amp;lt;dest_model&amp;gt;&lt;/code>: Copy a model.&lt;/li>
&lt;li>&lt;code>ollama rm &amp;lt;model&amp;gt;&lt;/code>: Delete a model.&lt;/li>
&lt;li>&lt;code>ollama ps&lt;/code>: View running models and their resource usage.&lt;/li>
&lt;li>&lt;code>ollama stop &amp;lt;model&amp;gt;&lt;/code>: Stop a running model and unload it from memory.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-openai-api-compatibility">6.1 OpenAI API Compatibility&lt;/h3>
&lt;p>Ollama provides an endpoint compatible with the OpenAI API, allowing you to seamlessly migrate existing OpenAI applications to Ollama. The default address is &lt;code>http://localhost:11434/v1&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>List Models (Python):&lt;/strong>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.models.list()
print(response)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="62-structured-output">6.2 Structured Output&lt;/h3>
&lt;p>By combining the OpenAI-compatible API with Pydantic, you can force the model to output JSON with a specific structure.&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:11434/v1&amp;quot;, api_key=&amp;quot;ollama&amp;quot;)
class UserInfo(BaseModel):
name: str
age: int
try:
completion = client.beta.chat.completions.parse(
model=&amp;quot;llama3.1:8b&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;My name is John and I am 30 years old.&amp;quot;}],
response_format=UserInfo,
)
print(completion.choices[0].message.parsed)
except Exception as e:
print(f&amp;quot;Error: {e}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="63-performance-tuning">6.3 Performance Tuning&lt;/h3>
&lt;p>You can adjust Ollama's performance and resource management through environment variables:&lt;/p>
&lt;ul>
&lt;li>&lt;code>OLLAMA_KEEP_ALIVE&lt;/code>: Set how long models remain active in memory. For example, &lt;code>10m&lt;/code>, &lt;code>24h&lt;/code>, or &lt;code>-1&lt;/code> (permanent).&lt;/li>
&lt;li>&lt;code>OLLAMA_MAX_LOADED_MODELS&lt;/code>: Maximum number of models loaded into memory simultaneously.&lt;/li>
&lt;li>&lt;code>OLLAMA_NUM_PARALLEL&lt;/code>: Number of requests each model can process in parallel.&lt;/li>
&lt;/ul>
&lt;h3 id="64-lora-adapters">6.4 LoRA Adapters&lt;/h3>
&lt;p>Use the &lt;code>ADAPTER&lt;/code> directive in a &lt;code>Modelfile&lt;/code> to apply a LoRA (Low-Rank Adaptation) adapter, changing the model's behavior without modifying the base model weights.&lt;/p>
&lt;pre>&lt;code class="language-Modelfile">FROM llama3
ADAPTER /path/to/your-lora-adapter.safetensors
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="7-appendix">7. Appendix&lt;/h2>
&lt;h3 id="71-troubleshooting">7.1 Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Check CPU Features:&lt;/strong> On Linux, you can use the following command to check if your CPU supports instruction sets like AVX, which are crucial for the performance of certain models.
&lt;pre>&lt;code class="language-shell">cat /proc/cpuinfo | grep flags | head -1
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="72-contribution-guidelines">7.2 Contribution Guidelines&lt;/h3>
&lt;p>Ollama is an open-source project, and community contributions are welcome. When submitting code, please follow good commit message formats, for example:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Good:&lt;/strong> &lt;code>llm/backend/mlx: support the llama architecture&lt;/code>&lt;/li>
&lt;li>&lt;strong>Bad:&lt;/strong> &lt;code>feat: add more emoji&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="73-related-links">7.3 Related Links&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Official Website:&lt;/strong> &lt;a href="https://ollama.com/">&lt;a href="https://ollama.com/">https://ollama.com/&lt;/a>&lt;/a>&lt;/li>
&lt;li>&lt;strong>GitHub Repository:&lt;/strong> &lt;a href="https://github.com/ollama/ollama">&lt;a href="https://github.com/ollama/ollama">https://github.com/ollama/ollama&lt;/a>&lt;/a>&lt;/li>
&lt;li>&lt;strong>Model Library:&lt;/strong> &lt;a href="https://ollama.com/library">&lt;a href="https://ollama.com/library">https://ollama.com/library&lt;/a>&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>ngrok Technical Guide: Public Network Mapping and Tunneling for Local Services</title><link>https://ziyanglin.netlify.app/en/post/ngrok-documentation/</link><pubDate>Fri, 27 Jun 2025 01:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/ngrok-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;h3 id="11-what-is-ngrok">1.1 What is ngrok?&lt;/h3>
&lt;p>ngrok is a powerful reverse proxy tool that can expose your local development environment to the public internet. By creating a secure tunnel, ngrok can forward requests from the public internet to services running on your local machine. This makes it exceptionally easy to integrate with external services (such as Webhooks, APIs) during development and testing phases.&lt;/p>
&lt;h3 id="12-how-it-works">1.2 How It Works&lt;/h3>
&lt;p>The working principle of ngrok can be summarized in the following steps:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Start the ngrok client&lt;/strong>: You run the ngrok client on your local machine and specify the local port to expose.&lt;/li>
&lt;li>&lt;strong>Establish a secure tunnel&lt;/strong>: The ngrok client connects to the ngrok cloud service and establishes a secure encrypted tunnel.&lt;/li>
&lt;li>&lt;strong>Assign a public address&lt;/strong>: The ngrok cloud service assigns you a unique public URL (e.g., &lt;code>https://random-string.ngrok.io&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Forward requests&lt;/strong>: When requests are sent to this public URL, the ngrok cloud service forwards them through the tunnel to your local ngrok client.&lt;/li>
&lt;li>&lt;strong>Access local service&lt;/strong>: The ngrok client then forwards the requests to the service running on the specified local port.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User as User/External Service
participant NgrokCloud as ngrok Cloud Service
participant NgrokClient as Local ngrok Client
participant LocalServer as Local Web Service
User-&amp;gt;&amp;gt;NgrokCloud: Request https://&amp;lt;subdomain&amp;gt;.ngrok.io
NgrokCloud-&amp;gt;&amp;gt;NgrokClient: Forward request through secure tunnel
NgrokClient-&amp;gt;&amp;gt;LocalServer: Request http://localhost:&amp;lt;port&amp;gt;
LocalServer--&amp;gt;&amp;gt;NgrokClient: Return response
NgrokClient--&amp;gt;&amp;gt;NgrokCloud: Return response through secure tunnel
NgrokCloud--&amp;gt;&amp;gt;User: Return final response
&lt;/code>&lt;/pre>
&lt;h3 id="13-why-use-ngrok">1.3 Why Use ngrok?&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Webhook Development&lt;/strong>: Locally develop and test applications that need to receive webhooks (like GitHub, Stripe, Twilio).&lt;/li>
&lt;li>&lt;strong>API Testing&lt;/strong>: Allow mobile applications or other external services to access APIs you're developing locally.&lt;/li>
&lt;li>&lt;strong>Project Demonstration&lt;/strong>: Quickly demonstrate a website or application under development to clients or colleagues without deploying to a server.&lt;/li>
&lt;li>&lt;strong>Debugging&lt;/strong>: Capture and inspect all HTTP requests and responses through the tunnel for easy debugging.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="2-quick-start">2. Quick Start&lt;/h2>
&lt;h3 id="21-download-and-installation">2.1 Download and Installation&lt;/h3>
&lt;ol>
&lt;li>&lt;strong>Visit the official website&lt;/strong>: Go to &lt;a href="https://ngrok.com/download">ngrok's official website&lt;/a>.&lt;/li>
&lt;li>&lt;strong>Download the client&lt;/strong>: Download the ngrok client corresponding to your operating system (Windows, macOS, Linux).&lt;/li>
&lt;li>&lt;strong>Extract the file&lt;/strong>: After downloading, extract the compressed package. You'll get an executable file named &lt;code>ngrok&lt;/code>.&lt;/li>
&lt;/ol>
&lt;h3 id="22-account-and-authtoken">2.2 Account and Authtoken&lt;/h3>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Register an account&lt;/strong>: Register a free account on &lt;a href="https://dashboard.ngrok.com/signup">ngrok's official website&lt;/a>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Get your Authtoken&lt;/strong>: After logging in, find your Authtoken on your &lt;a href="https://dashboard.ngrok.com/get-started/your-authtoken">dashboard&lt;/a> page.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Configure your Authtoken&lt;/strong>: Open a terminal, navigate to the directory containing the ngrok executable, and run the following command to add your Authtoken to the default configuration file &lt;code>ngrok.yml&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-bash">./ngrok config add-authtoken &amp;lt;YOUR_AUTHTOKEN&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>After configuring your Authtoken, you'll be able to use more features, such as custom subdomains, longer session times, etc.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="23-establish-your-first-tunnel">2.3 Establish Your First Tunnel&lt;/h3>
&lt;p>Suppose you have a Web service running on port &lt;code>8000&lt;/code> locally, you can use the following command to create a public tunnel for it:&lt;/p>
&lt;pre>&lt;code class="language-bash">./ngrok http 8000
&lt;/code>&lt;/pre>
&lt;p>After executing the command, you'll see output similar to the following in your terminal:&lt;/p>
&lt;pre>&lt;code>ngrok by @inconshreveable (Ctrl+C to quit)
Session Status online
Account Your Name (Plan: Free)
Version 3.x.x
Region United States (us)
Web Interface http://127.0.0.1:4040
Forwarding https://9a1b-2c3d-4e5f-6a7b-8c9d.ngrok.io -&amp;gt; http://localhost:8000
Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00
&lt;/code>&lt;/pre>
&lt;p>Now, you can access your local service on port &lt;code>8000&lt;/code> through the public address &lt;code>https://9a1b-2c3d-4e5f-6a7b-8c9d.ngrok.io&lt;/code>.&lt;/p>
&lt;p>At the same time, you can also access ngrok's Web interface by visiting &lt;code>http://127.0.0.1:4040&lt;/code> in your browser, where you can view details of all requests and responses through the tunnel.&lt;/p>
&lt;hr>
&lt;h2 id="3-core-concepts">3. Core Concepts&lt;/h2>
&lt;h3 id="31-tunnel-protocols">3.1 Tunnel Protocols&lt;/h3>
&lt;p>ngrok supports multiple protocols for creating tunnels:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>HTTP/HTTPS&lt;/strong>: The most commonly used protocol for exposing Web services.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Expose HTTP service on local port 80
ngrok http 80
# Expose HTTPS service on local port 3000
ngrok http https://localhost:3000
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>TCP&lt;/strong>: Used to expose non-HTTP services, such as SSH, database connections, game servers, etc.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Expose SSH service on local port 22
ngrok tcp 22
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>TLS&lt;/strong>: Used to expose TCP services that require end-to-end TLS encryption.&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok tls --domain=your-domain.com 443
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="32-custom-domains">3.2 Custom Domains&lt;/h3>
&lt;p>For paid users, ngrok allows you to use custom subdomains or fully custom domains.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Custom subdomain&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --subdomain=my-awesome-app 8080
&lt;/code>&lt;/pre>
&lt;p>This will expose your service at &lt;code>https://my-awesome-app.ngrok.io&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Custom domain&lt;/strong> (requires a paid plan and CNAME configuration):&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --hostname=dev.example.com 80
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="4-advanced-usage">4. Advanced Usage&lt;/h2>
&lt;h3 id="41-configuration-file">4.1 Configuration File&lt;/h3>
&lt;p>In addition to specifying parameters on the command line, you can also define tunnels through the &lt;code>ngrok.yml&lt;/code> configuration file. This is very useful for managing multiple tunnels and complex configurations.&lt;/p>
&lt;p>By default, the configuration file is located at:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>macOS&lt;/strong>: &lt;code>~/Library/Application Support/ngrok/ngrok.yml&lt;/code>&lt;/li>
&lt;li>&lt;strong>Linux&lt;/strong>: &lt;code>~/.config/ngrok/ngrok.yml&lt;/code>&lt;/li>
&lt;li>&lt;strong>Windows&lt;/strong>: &lt;code>C:\Users\YourUser\AppData\Local\ngrok\ngrok.yml&lt;/code>&lt;/li>
&lt;/ul>
&lt;p>An example of a configuration file:&lt;/p>
&lt;pre>&lt;code class="language-yaml">version: &amp;quot;2&amp;quot;
authtoken: &amp;lt;YOUR_AUTHTOKEN&amp;gt;
tunnels:
my-api:
proto: http
addr: 8080
subdomain: my-cool-api
ssh:
proto: tcp
addr: 22
&lt;/code>&lt;/pre>
&lt;p>After configuration, you can start tunnels by name:&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok start my-api
ngrok start ssh
ngrok start --all # Start all defined tunnels
&lt;/code>&lt;/pre>
&lt;h3 id="42-security-options">4.2 Security Options&lt;/h3>
&lt;p>ngrok provides various security features to protect your tunnels:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>HTTP Basic Authentication&lt;/strong>: Add username and password protection to your tunnel.&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --basic-auth=&amp;quot;username:password&amp;quot; 8000
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>OAuth 2.0&lt;/strong> (paid feature): Integrate with OAuth providers like Google, GitHub, Microsoft, etc., so that only authenticated users can access your tunnel.&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --oauth=google --oauth-allow-emails=user@example.com 8000
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>IP Restrictions&lt;/strong> (paid feature): Only allow or deny access from specific IP addresses or CIDR ranges.&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --ip-restriction-allow-cidrs=203.0.113.0/24 8000
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="43-webhook-verification-paid-feature">4.3 Webhook Verification (paid feature)&lt;/h3>
&lt;p>ngrok can automatically verify signatures of webhook requests from certain services (like Twilio, Stripe), increasing security.&lt;/p>
&lt;pre>&lt;code class="language-bash">ngrok http --verify-webhook=twilio --verify-webhook-secret=&amp;lt;YOUR_SECRET&amp;gt; 8000
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="5-api-and-integration">5. API and Integration&lt;/h2>
&lt;p>ngrok provides official client libraries that allow you to control tunnels programmatically. &lt;code>@ngrok/ngrok&lt;/code> is the official Node.js library.&lt;/p>
&lt;h3 id="51-installation">5.1 Installation&lt;/h3>
&lt;pre>&lt;code class="language-bash">npm install @ngrok/ngrok
&lt;/code>&lt;/pre>
&lt;h3 id="52-example-starting-a-tunnel-in-a-nodejs-application">5.2 Example: Starting a Tunnel in a Node.js Application&lt;/h3>
&lt;pre>&lt;code class="language-javascript">const ngrok = require(&amp;quot;@ngrok/ngrok&amp;quot;);
// Set up Express application
const express = require('express');
const app = express();
const port = 8080;
app.get('/', (req, res) =&amp;gt; {
res.send('Hello from local server!');
});
app.listen(port, async () =&amp;gt; {
console.log(`Local server listening at http://localhost:${port}`);
// Start ngrok tunnel
try {
const listener = await ngrok.forward({
addr: port,
authtoken_from_env: true, // Read from NGROK_AUTHTOKEN environment variable
});
console.log(`Ingress established at: ${listener.url()}`);
} catch (error) {
console.error(&amp;quot;Error establishing ngrok tunnel:&amp;quot;, error);
}
});
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="6-frequently-asked-questions-faq">6. Frequently Asked Questions (FAQ)&lt;/h2>
&lt;p>&lt;strong>Q: Is the ngrok tunnel address fixed?&lt;/strong>
A: In the free plan, you'll get a new random URL each time you restart the ngrok client. Paid plan users can use fixed subdomains or custom domains.&lt;/p>
&lt;p>&lt;strong>Q: How do I run ngrok in the background?&lt;/strong>
A: On Linux or macOS, you can use &lt;code>&amp;amp;&lt;/code> to place it in the background: &lt;code>./ngrok http 8000 &amp;amp;&lt;/code>. For more stable solutions, it's recommended to use tools like &lt;code>systemd&lt;/code> or &lt;code>supervisor&lt;/code> to manage the ngrok process.&lt;/p>
&lt;p>&lt;strong>Q: What are the main differences between the free and paid versions?&lt;/strong>
A: The paid version offers more advanced features, including:&lt;/p>
&lt;ul>
&lt;li>Custom/fixed subdomains&lt;/li>
&lt;li>Custom domains&lt;/li>
&lt;li>More concurrent tunnels&lt;/li>
&lt;li>IP whitelisting/blacklisting&lt;/li>
&lt;li>OAuth integration&lt;/li>
&lt;li>Longer session timeout times&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Q: Can I run multiple tunnels simultaneously?&lt;/strong>
A: Yes. You can define multiple tunnels in the configuration file and start them with &lt;code>ngrok start --all&lt;/code>, or open multiple terminal windows and run the &lt;code>ngrok&lt;/code> command separately. The free version has limitations on the number of concurrent tunnels.&lt;/p></description></item><item><title>Model Quantization Guide: A Comprehensive Analysis from Theory to Practice</title><link>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</link><pubDate>Fri, 27 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/model-quantization-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>As large language models (LLMs) continue to grow in scale and complexity, their deployment and inference costs have become increasingly expensive. Model quantization, as a key optimization technique, significantly reduces model storage requirements, memory consumption, and computational load by lowering the numerical precision of model weights and activation values, enabling efficient inference on resource-constrained devices such as mobile and edge devices.&lt;/p>
&lt;p>This document aims to provide a clear and comprehensive introduction to the core concepts of deep learning model quantization, mainstream approaches, and specific implementations in two leading inference frameworks—&lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>. We will explore in detail the quantization types they support, underlying principles, usage methods, and future trends in quantization technology.&lt;/p>
&lt;h2 id="2-quantization-fundamentals">2. Quantization Fundamentals&lt;/h2>
&lt;p>Before diving into specific frameworks, we need to understand some basic concepts of quantization.&lt;/p>
&lt;h3 id="21-what-is-model-quantization">2.1 What is Model Quantization?&lt;/h3>
&lt;p>Model quantization refers to the process of converting floating-point numbers in a model (typically 32-bit floating-point, or &lt;code>FP32&lt;/code>) to integers with fewer bits (such as &lt;code>INT8&lt;/code>, &lt;code>INT4&lt;/code>) or lower-precision floating-point numbers (such as &lt;code>FP16&lt;/code>, &lt;code>FP8&lt;/code>). This process is essentially a form of information compression that attempts to significantly reduce model complexity while preserving model accuracy as much as possible.&lt;/p>
&lt;h3 id="22-why-is-quantization-needed">2.2 Why is Quantization Needed?&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size&lt;/strong>: Lower bit-width numerical representations can significantly reduce the size of model files. For example, quantizing an &lt;code>FP32&lt;/code> model to &lt;code>INT8&lt;/code> can reduce the model size by approximately 4 times.&lt;/li>
&lt;li>&lt;strong>Lower Memory Bandwidth&lt;/strong>: Smaller data types mean less bandwidth is occupied when transferring data between memory and computational units, which is crucial for memory bandwidth-sensitive hardware.&lt;/li>
&lt;li>&lt;strong>Accelerated Computation&lt;/strong>: Many modern processors (CPUs, GPUs, TPUs) support integer operations more efficiently than floating-point operations, providing higher throughput and lower latency.&lt;/li>
&lt;li>&lt;strong>Reduced Power Consumption&lt;/strong>: Integer operations typically consume less energy than floating-point operations.&lt;/li>
&lt;/ul>
&lt;h3 id="23-quantization-principles-mapping-and-dequantization">2.3 Quantization Principles: Mapping and Dequantization&lt;/h3>
&lt;p>The core of quantization is mapping a larger range of floating-point values to a smaller range of fixed-point integer values. This process is defined by the following formula:&lt;/p>
&lt;pre>&lt;code>Q(r) = round(r / S + Z)
&lt;/code>&lt;/pre>
&lt;p>Where:&lt;/p>
&lt;ul>
&lt;li>&lt;code>r&lt;/code> is the original floating-point value.&lt;/li>
&lt;li>&lt;code>Q(r)&lt;/code> is the quantized integer value.&lt;/li>
&lt;li>&lt;code>S&lt;/code> is the &lt;strong>Scale factor&lt;/strong>, representing the floating-point value size corresponding to each quantized integer step.&lt;/li>
&lt;li>&lt;code>Z&lt;/code> is the &lt;strong>Zero-point&lt;/strong>, representing the quantized integer value corresponding to floating-point zero.&lt;/li>
&lt;/ul>
&lt;p>When performing calculations, the quantized values need to be dequantized back to the floating-point domain:&lt;/p>
&lt;pre>&lt;code>r' = S * (Q(r) - Z)
&lt;/code>&lt;/pre>
&lt;p>&lt;code>r'&lt;/code> is the dequantized floating-point number, which has some quantization error compared to the original value &lt;code>r&lt;/code>.&lt;/p>
&lt;h3 id="24-symmetric-vs-asymmetric-quantization">2.4 Symmetric vs. Asymmetric Quantization&lt;/h3>
&lt;p>Based on the choice of zero-point, quantization can be divided into two modes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Symmetric Quantization&lt;/strong>: Maps the floating-point range &lt;code>[-abs_max, abs_max]&lt;/code> symmetrically to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is typically 0 (for signed integers) or &lt;code>2^(bits-1)&lt;/code> (for unsigned integer offset). Computation is relatively simple.&lt;/li>
&lt;li>&lt;strong>Asymmetric Quantization&lt;/strong>: Maps the complete floating-point range &lt;code>[min, max]&lt;/code> to the integer range. In this mode, the zero-point &lt;code>Z&lt;/code> is a floating-point number that can be adjusted according to data distribution. It can more accurately represent asymmetrically distributed data but is slightly more complex in computation.&lt;/li>
&lt;/ul>
&lt;h3 id="25-perlayer-vs-pergroupperchannel-quantization">2.5 Per-Layer vs. Per-Group/Per-Channel Quantization&lt;/h3>
&lt;p>The granularity of calculating scale factor &lt;code>S&lt;/code> and zero-point &lt;code>Z&lt;/code> also affects quantization accuracy:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Per-Layer/Per-Tensor&lt;/strong>: The entire weight tensor (or all weights in a layer) shares the same set of &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This approach is the simplest, but if the value distribution within the tensor is uneven, it may lead to larger errors.&lt;/li>
&lt;li>&lt;strong>Per-Channel&lt;/strong>: For weights in convolutional layers, each output channel uses independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Grouped Quantization&lt;/strong>: The weight tensor is divided into several groups, with each group using independent &lt;code>S&lt;/code> and &lt;code>Z&lt;/code>. This is currently a very popular approach in LLM quantization as it achieves a good balance between accuracy and overhead. The group size is a key hyperparameter.&lt;/li>
&lt;/ul>
&lt;h3 id="26-common-quantization-paradigms">2.6 Common Quantization Paradigms&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Post-Training Quantization (PTQ)&lt;/strong>: This is the most commonly used and convenient quantization method. It is performed after the model has been fully trained, without requiring retraining. PTQ typically needs a small calibration dataset to calculate the optimal quantization parameters (&lt;code>S&lt;/code> and &lt;code>Z&lt;/code>) by analyzing the distribution of weights and activation values.&lt;/li>
&lt;li>&lt;strong>Quantization-Aware Training (QAT)&lt;/strong>: This simulates the errors introduced by quantization during the model training process. By inserting pseudo-quantization nodes in the forward pass during training, it allows the model to adapt to the accuracy loss caused by quantization. QAT typically achieves higher accuracy than PTQ but requires a complete training process and dataset, making it more costly.&lt;/li>
&lt;/ul>
&lt;p>Now that we have the basic knowledge of quantization, let's delve into the specific implementations in &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code>.&lt;/p>
&lt;h2 id="3-quantization-schemes-in-llamacpp">3. Quantization Schemes in llama.cpp&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is an efficient LLM inference engine written in C/C++, renowned for its excellent cross-platform performance and support for resource-constrained devices. One of its core advantages is its powerful and flexible quantization support, which revolves around its self-developed &lt;code>GGUF&lt;/code> (Georgi Gerganov Universal Format) file format.&lt;/p>
&lt;h3 id="31-gguf-format-and-quantization">3.1 GGUF Format and Quantization&lt;/h3>
&lt;p>GGUF is a binary format specifically designed for LLMs, used to store model metadata, vocabulary, and weights. A key feature is its native support for various quantized weights, allowing different precision tensors to be mixed within the same file. This enables &lt;code>llama.cpp&lt;/code> to directly use quantized weights when loading models, without additional conversion steps.&lt;/p>
&lt;h3 id="32-quantization-type-nomenclature-in-llamacpp">3.2 Quantization Type Nomenclature in &lt;code>llama.cpp&lt;/code>&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> defines a very specific quantization type naming convention, typically in the format &lt;code>Q&amp;lt;bits&amp;gt;_&amp;lt;type&amp;gt;&lt;/code>. Understanding these names is key to mastering &lt;code>llama.cpp&lt;/code> quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q&lt;/code>&lt;/strong>: Represents quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;bits&amp;gt;&lt;/code>&lt;/strong>: Indicates the average number of bits per weight, such as &lt;code>2&lt;/code>, &lt;code>3&lt;/code>, &lt;code>4&lt;/code>, &lt;code>5&lt;/code>, &lt;code>6&lt;/code>, &lt;code>8&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>&amp;lt;type&amp;gt;&lt;/code>&lt;/strong>: Indicates the specific quantization method or variant.&lt;/li>
&lt;/ul>
&lt;p>Below are some of the most common quantization types and their explanations:&lt;/p>
&lt;h4 id="321-basic-quantization-types-legacy">3.2.1 Basic Quantization Types (Legacy)&lt;/h4>
&lt;p>These are earlier quantization methods, most of which have now been replaced by &lt;code>K-Quants&lt;/code>, but are still retained for compatibility.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_0&lt;/code>, &lt;code>Q4_1&lt;/code>&lt;/strong>: 4-bit quantization. &lt;code>Q4_1&lt;/code> uses higher precision scale factors than &lt;code>Q4_0&lt;/code>, thus typically achieving higher accuracy.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_0&lt;/code>, &lt;code>Q5_1&lt;/code>&lt;/strong>: 5-bit quantization.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q8_0&lt;/code>&lt;/strong>: 8-bit symmetric quantization using block-wise scale factors. This is one of the quantization types closest to the original &lt;code>FP16&lt;/code> precision and often serves as a benchmark for performance and quality.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q2_K&lt;/code>, &lt;code>Q3_K&lt;/code>, &lt;code>Q4_K&lt;/code>, &lt;code>Q5_K&lt;/code>, &lt;code>Q6_K&lt;/code>&lt;/strong>: These are the &lt;code>K-Quants&lt;/code> series.&lt;/li>
&lt;/ul>
&lt;h4 id="322-kquants-recommended">3.2.2 K-Quants (Recommended)&lt;/h4>
&lt;p>&lt;code>K-Quants&lt;/code> is a more advanced and flexible quantization scheme introduced in &lt;code>llama.cpp&lt;/code>. They achieve better precision preservation at extremely low bit rates through more refined block structures and the concept of super-blocks.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Block&lt;/strong>: Weights are divided into fixed-size blocks (typically 256 weights).&lt;/li>
&lt;li>&lt;strong>Super-block&lt;/strong>: Multiple blocks form a super-block. More detailed quantization parameters (such as min/max scale factors) are stored at the super-block level.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>K-Quants&lt;/code> naming typically includes a suffix like &lt;code>_S&lt;/code>, &lt;code>_M&lt;/code>, &lt;code>_L&lt;/code>, indicating different sizes/complexities:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>S&lt;/code> (Small)&lt;/strong>: The smallest version, typically with the lowest precision.&lt;/li>
&lt;li>&lt;strong>&lt;code>M&lt;/code> (Medium)&lt;/strong>: Medium size, balancing precision and size.&lt;/li>
&lt;li>&lt;strong>&lt;code>L&lt;/code> (Large)&lt;/strong>: The largest version, typically with the highest precision.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Common K-Quants Types:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;code>Q4_K_M&lt;/code>&lt;/strong>: 4-bit K-Quant, medium size. This is currently one of the most commonly used and recommended 4-bit quantization types, achieving a good balance between size and performance.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q4_K_S&lt;/code>&lt;/strong>: 4-bit K-Quant, small version.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q5_K_M&lt;/code>&lt;/strong>: 5-bit K-Quant, medium size. Provides better precision than 4-bit while being smaller than &lt;code>Q8_0&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>Q6_K&lt;/code>&lt;/strong>: 6-bit K-Quant. Provides very high precision, close to &lt;code>Q8_0&lt;/code>, but with a smaller size.&lt;/li>
&lt;li>&lt;strong>&lt;code>IQ2_XS&lt;/code>, &lt;code>IQ2_S&lt;/code>, &lt;code>IQ2_XXS&lt;/code>&lt;/strong>: 2-bit quantization variants, where &lt;code>IQ&lt;/code> stands for &amp;ldquo;Inaccurate Quantization,&amp;rdquo; aimed at extreme model compression but with larger precision loss.&lt;/li>
&lt;/ul>
&lt;h3 id="33-how-to-use-the-llamaquantize-tool">3.3 How to Use the &lt;code>llama-quantize&lt;/code> Tool&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a command-line tool called &lt;code>llama-quantize&lt;/code> for converting &lt;code>FP32&lt;/code> or &lt;code>FP16&lt;/code> GGUF models to quantized GGUF models.&lt;/p>
&lt;p>&lt;strong>Basic Usage:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-quantize &amp;lt;input-gguf-file&amp;gt; &amp;lt;output-gguf-file&amp;gt; &amp;lt;quantization-type&amp;gt;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Quantizing an FP16 Model to Q4_K_M&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># First, convert the original model (e.g., PyTorch format) to FP16 GGUF
python3 convert.py models/my-model/
# Then, use llama-quantize for quantization
./llama-quantize ./models/my-model/ggml-model-f16.gguf ./models/my-model/ggml-model-Q4_K_M.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="34-importance-matrix">3.4 Importance Matrix&lt;/h3>
&lt;p>To further reduce precision loss from quantization, &lt;code>llama.cpp&lt;/code> introduced the concept of an importance matrix (&lt;code>imatrix&lt;/code>). This matrix calculates the importance of each weight by running the model on a calibration dataset. During quantization, &lt;code>llama-quantize&lt;/code> references this matrix to apply smaller quantization errors to more important weights, thereby protecting critical information in the model.&lt;/p>
&lt;p>&lt;strong>Using &lt;code>imatrix&lt;/code> for Quantization:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># 1. Generate the importance matrix
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
# 2. Use imatrix for quantization
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-Q4_K_M-imatrix.gguf Q4_K_M
&lt;/code>&lt;/pre>
&lt;h3 id="35-summary">3.5 Summary&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code>'s quantization scheme is centered around the &lt;code>GGUF&lt;/code> format, providing a rich, efficient, and battle-tested set of quantization types. Its &lt;code>K-Quants&lt;/code> series performs exceptionally well in low-bit quantization, and when combined with advanced techniques like importance matrices, it can maximize model performance while significantly compressing the model. For scenarios requiring LLM deployment on CPUs or resource-limited hardware, &lt;code>llama.cpp&lt;/code> is an excellent choice.&lt;/p>
&lt;h2 id="4-vllms-quantization-ecosystem">4. vLLM's Quantization Ecosystem&lt;/h2>
&lt;p>Unlike &lt;code>llama.cpp&lt;/code>'s cohesive, self-contained quantization system, &lt;code>vLLM&lt;/code>, as a service engine focused on high-performance, high-throughput GPU inference, adopts a &amp;ldquo;best of all worlds&amp;rdquo; quantization strategy. &lt;code>vLLM&lt;/code> doesn't invent new quantization formats but instead embraces compatibility, supporting and integrating the most mainstream and cutting-edge quantization schemes and tool libraries from academia and industry.&lt;/p>
&lt;h3 id="41-mainstream-quantization-schemes-supported-by-vllm">4.1 Mainstream Quantization Schemes Supported by vLLM&lt;/h3>
&lt;p>&lt;code>vLLM&lt;/code> supports directly loading models quantized by various popular algorithms and tool libraries:&lt;/p>
&lt;h4 id="411-gptq-generalpurpose-posttraining-quantization">4.1.1 GPTQ (General-purpose Post-Training Quantization)&lt;/h4>
&lt;p>GPTQ is one of the earliest widely applied LLM PTQ algorithms. It quantizes weights column by column and updates weights using Hessian matrix information to minimize quantization error.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Iteratively quantize each column of weights and update the remaining unquantized weights to compensate for errors introduced by already quantized columns.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load GPTQ quantized models generated by libraries like &lt;code>AutoGPTQ&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing good 4-bit quantization performance with a large number of pre-quantized models available in the community.&lt;/li>
&lt;/ul>
&lt;h4 id="412-awq-activationaware-weight-quantization">4.1.2 AWQ (Activation-aware Weight Quantization)&lt;/h4>
&lt;p>AWQ observes that not all weights in a model are equally important, with a small portion of &amp;ldquo;significant weights&amp;rdquo; having a huge impact on model performance. Similar uneven distributions also exist in activation values.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: By analyzing the scale of activation values, identify and protect those &amp;ldquo;significant weights&amp;rdquo; that multiply with large activation values, giving them higher precision during quantization. It doesn't quantize activation values but makes weights adapt to the distribution of activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Can directly load AWQ quantized models generated by the &lt;code>AutoAWQ&lt;/code> library.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Seeking higher model precision than GPTQ at extremely low bits (such as 4-bit), especially when handling complex tasks.&lt;/li>
&lt;/ul>
&lt;h4 id="413-fp8-8bit-floating-point">4.1.3 FP8 (8-bit Floating Point)&lt;/h4>
&lt;p>FP8 is the latest low-precision floating-point format, pushed by hardware manufacturers like NVIDIA. It has a wider dynamic range than traditional &lt;code>INT8&lt;/code>, making it more suitable for representing extremely unevenly distributed activation values in LLMs.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Use 8-bit floating-point numbers (typically in &lt;code>E4M3&lt;/code> or &lt;code>E5M2&lt;/code> format) to represent weights and/or activation values.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: Through integration with &lt;code>llm-compressor&lt;/code> and AMD's &lt;code>Quark&lt;/code> library, &lt;code>vLLM&lt;/code> provides strong support for FP8, including both dynamic and static quantization.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Pursuing ultimate inference speed and throughput on modern GPUs (such as H100) that support FP8 acceleration.&lt;/li>
&lt;/ul>
&lt;h4 id="414-fp8-kv-cache">4.1.4 FP8 KV Cache&lt;/h4>
&lt;p>This is a quantization technique specifically targeting the KV Cache, a major memory consumer during inference.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Quantize the Key-Value cache stored in GPU memory from &lt;code>FP16&lt;/code> or &lt;code>BF16&lt;/code> to &lt;code>FP8&lt;/code>, thereby halving this portion of memory usage, allowing the model to support longer context windows or larger batch sizes.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> provides native support, which can be enabled at startup with the parameter &lt;code>--kv-cache-dtype fp8&lt;/code>.&lt;/li>
&lt;/ul>
&lt;h4 id="415-bitsandbytes">4.1.5 BitsAndBytes&lt;/h4>
&lt;p>This is a very popular quantization library, known for its ease of use and &amp;ldquo;on-the-fly&amp;rdquo; quantization.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Core Idea&lt;/strong>: Dynamically quantize during model loading, without needing pre-prepared quantized model files.&lt;/li>
&lt;li>&lt;strong>vLLM Support&lt;/strong>: &lt;code>vLLM&lt;/code> integrates &lt;code>BitsAndBytes&lt;/code>, allowing users to easily enable 4-bit quantization by setting the &lt;code>quantization=&amp;quot;bitsandbytes&amp;quot;&lt;/code> parameter.&lt;/li>
&lt;li>&lt;strong>Suitable Scenarios&lt;/strong>: Quick experimentation, user-friendly, avoiding complex offline quantization processes.&lt;/li>
&lt;/ul>
&lt;h4 id="416-other-schemes">4.1.6 Other Schemes&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>SqueezeLLM&lt;/strong>: A non-uniform quantization method that believes weight importance is related to numerical size, thus using fewer bits for smaller weight values and more bits for larger weight values.&lt;/li>
&lt;li>&lt;strong>TorchAO&lt;/strong>: PyTorch's official quantization tool library, which &lt;code>vLLM&lt;/code> is beginning to support.&lt;/li>
&lt;li>&lt;strong>BitBLAS&lt;/strong>: A low-level computation library aimed at accelerating low-bit (such as 1-bit, 2-bit, 4-bit) matrix operations through optimized kernel functions.&lt;/li>
&lt;/ul>
&lt;h3 id="42-how-to-use-quantized-models-in-vllm">4.2 How to Use Quantized Models in vLLM&lt;/h3>
&lt;p>Using quantization in &lt;code>vLLM&lt;/code> is very simple, typically just requiring specifying the &lt;code>quantization&lt;/code> parameter in the &lt;code>LLM&lt;/code> constructor. &lt;code>vLLM&lt;/code> will automatically detect the quantization type from the model's configuration file (&lt;code>config.json&lt;/code>).&lt;/p>
&lt;p>&lt;strong>Example: Loading an AWQ Quantized Model&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
# vLLM will automatically recognize awq quantization from &amp;quot;TheBloke/My-Model-AWQ&amp;quot;'s config.json
llm = LLM(model=&amp;quot;TheBloke/My-Model-AWQ&amp;quot;, quantization=&amp;quot;awq&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Example: Enabling FP8 KV Cache&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h2 id="5-llamacpp-vs-vllm-comparison-and-summary">5. llama.cpp vs. vLLM: Comparison and Summary&lt;/h2>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">llama.cpp&lt;/th>
&lt;th align="left">vLLM&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Target Platform&lt;/strong>&lt;/td>
&lt;td align="left">CPU, Cross-platform, Edge devices&lt;/td>
&lt;td align="left">High-performance GPU servers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Core Philosophy&lt;/strong>&lt;/td>
&lt;td align="left">Cohesive, self-contained, extreme optimization&lt;/td>
&lt;td align="left">Open, integrated, high throughput&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>File Format&lt;/strong>&lt;/td>
&lt;td align="left">GGUF (custom format)&lt;/td>
&lt;td align="left">Standard Hugging Face format&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Quantization Schemes&lt;/strong>&lt;/td>
&lt;td align="left">Built-in &lt;code>K-Quants&lt;/code>, &lt;code>IQ&lt;/code>, etc.&lt;/td>
&lt;td align="left">Integrates GPTQ, AWQ, FP8, BnB, etc.&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ease of Use&lt;/strong>&lt;/td>
&lt;td align="left">Requires &lt;code>llama-quantize&lt;/code> conversion&lt;/td>
&lt;td align="left">Direct loading, automatic detection&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Ecosystem&lt;/strong>&lt;/td>
&lt;td align="left">Self-contained ecosystem&lt;/td>
&lt;td align="left">Embraces the entire Python AI ecosystem&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Latest Technology&lt;/strong>&lt;/td>
&lt;td align="left">Quickly follows up and implements own versions&lt;/td>
&lt;td align="left">Quickly integrates latest open-source libraries&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-latest-quantization-trends-and-outlook">6. Latest Quantization Trends and Outlook&lt;/h2>
&lt;p>The field of model quantization is still rapidly evolving. Here are some trends worth noting:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>1-bit/Binary Neural Networks (BNNs)&lt;/strong>: Ultimate model compression, restricting weights to +1 or -1. Although currently suffering significant precision loss in LLMs, its potential is enormous, with related research emerging constantly.&lt;/li>
&lt;li>&lt;strong>Non-uniform Quantization&lt;/strong>: Like SqueezeLLM, dynamically allocating bit numbers based on data distribution, theoretically superior to uniform quantization.&lt;/li>
&lt;li>&lt;strong>Hardware-Algorithm Co-design&lt;/strong>: New hardware (such as FP8, FP4, INT4 support) is driving the development of new quantization algorithms, while new algorithms are guiding future hardware design.&lt;/li>
&lt;li>&lt;strong>Combining Quantization with Sparsification&lt;/strong>: Combining quantization with sparsification techniques like pruning holds promise for achieving higher rates of model compression.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>Model quantization is a key technology for addressing the challenges of the large model era. &lt;code>llama.cpp&lt;/code> and &lt;code>vLLM&lt;/code> represent two different quantization philosophies: &lt;code>llama.cpp&lt;/code> provides ultimate local inference performance for resource-constrained devices through its elegant GGUF format and built-in K-Quants; while &lt;code>vLLM&lt;/code> has become the king of GPU cloud inference services through its open ecosystem and integration of various cutting-edge quantization schemes.&lt;/p>
&lt;p>Understanding the quantization implementations of these two frameworks not only helps us choose the right tool for specific scenarios but also gives us insight into the development trajectory and future directions of the entire LLM inference optimization field.&lt;/p></description></item><item><title>VAD Technical Guide: Principles and Practices of Voice Activity Detection</title><link>https://ziyanglin.netlify.app/en/post/vad-documentation/</link><pubDate>Thu, 26 Jun 2025 02:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/vad-documentation/</guid><description>&lt;h2 id="1-vad-technology-overview-a-macrolevel-understanding">1. VAD Technology Overview: A Macro-level Understanding&lt;/h2>
&lt;h3 id="11-what-is-vad">1.1 What is VAD?&lt;/h3>
&lt;p>VAD (Voice Activity Detection) is a technology designed to accurately identify the presence of human speech in audio signals. Its core task is to segment an audio stream into two parts: &lt;strong>segments containing speech&lt;/strong> and &lt;strong>silent/noise segments without speech&lt;/strong>.&lt;/p>
&lt;p>From a macro perspective, VAD serves as the &amp;ldquo;gatekeeper&amp;rdquo; or &amp;ldquo;preprocessor&amp;rdquo; in the speech processing pipeline. It is crucial and typically the first step in any system that needs to process human speech.&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Raw Audio Stream&amp;quot;] --&amp;gt; B{&amp;quot;VAD Module&amp;quot;}
B --&amp;gt;|&amp;quot;Speech Detected&amp;quot;| C[&amp;quot;Speech Segments&amp;quot;]
B --&amp;gt;|&amp;quot;No Speech Detected&amp;quot;| D[&amp;quot;Silence/Noise Segments&amp;quot;]
C --&amp;gt; E[&amp;quot;Further Processing: ASR, Voice Print, etc.&amp;quot;]
D --&amp;gt; F[&amp;quot;Discard or Use for Noise Modeling&amp;quot;]
&lt;/code>&lt;/pre>
&lt;h3 id="12-why-is-vad-so-important">1.2 Why is VAD So Important?&lt;/h3>
&lt;p>The value of VAD is reflected in several key aspects:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Conserving Computational Resources&lt;/strong>: In compute-intensive tasks like automatic speech recognition (ASR), processing only detected speech segments avoids unnecessary computation on silence and background noise, saving 50% or more of CPU/GPU resources.&lt;/li>
&lt;li>&lt;strong>Improving Downstream Task Accuracy&lt;/strong>: Removing silent segments reduces interference for ASR models, voice print recognition models, or emotion analysis models, thereby improving their accuracy.&lt;/li>
&lt;li>&lt;strong>Optimizing Network Bandwidth&lt;/strong>: In real-time voice communication (like VoIP, WebRTC), silent segments can be either not transmitted or transmitted at extremely low bit rates (known as &amp;ldquo;Discontinuous Transmission&amp;rdquo;, DTX), significantly reducing network bandwidth usage.&lt;/li>
&lt;li>&lt;strong>Enhancing User Experience&lt;/strong>: In smart assistants and voice interaction scenarios, precise VAD enables more natural interaction, avoiding premature interruption of recognition during user pauses or false triggering in noisy environments.&lt;/li>
&lt;li>&lt;strong>Data Preprocessing and Annotation&lt;/strong>: When building large speech datasets, VAD can automatically segment and annotate effective speech segments, greatly improving data processing efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="2-traditional-vad-implementation-methods">2. Traditional VAD Implementation Methods&lt;/h2>
&lt;p>Before deep learning became popular, VAD primarily relied on manually designed acoustic features. These methods are computationally simple and fast but have poor robustness in complex noisy environments.&lt;/p>
&lt;p>The main methods include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Energy-based&lt;/strong>: The simplest method. It's generally assumed that the short-time energy of speech signals is much greater than background noise. Speech and silence are distinguished by setting an energy threshold.
&lt;ul>
&lt;li>&lt;strong>Advantage&lt;/strong>: Extremely simple computation.&lt;/li>
&lt;li>&lt;strong>Disadvantage&lt;/strong>: Very sensitive to noise and volume changes, with thresholds difficult to set.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Zero-Crossing Rate (ZCR)&lt;/strong>: ZCR describes the frequency at which a signal crosses zero. Unvoiced sounds (like &amp;lsquo;s&amp;rsquo;) have a higher ZCR, while voiced sounds and background noise have a lower ZCR.
&lt;ul>
&lt;li>&lt;strong>Advantage&lt;/strong>: Not sensitive to broadband noise.&lt;/li>
&lt;li>&lt;strong>Disadvantage&lt;/strong>: Poor discrimination between certain unvoiced sounds and noise.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Spectral Features&lt;/strong>: Such as spectral entropy, spectral flatness, etc. Speech signals typically have more complex and regular spectral structures than noise, resulting in lower spectral entropy and less flat spectra.&lt;/li>
&lt;li>&lt;strong>Combined Features&lt;/strong>: In practical applications, multiple features (such as energy+ZCR) are often combined with smoothing filter techniques to enhance stability. The famous &lt;strong>WebRTC VAD&lt;/strong> is a classic example based on Gaussian Mixture Models (GMM), extracting features across multiple frequency bands with good performance and efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="3-deep-learningbased-vad">3. Deep Learning-based VAD&lt;/h2>
&lt;p>With the development of deep learning, neural network-based VAD methods far outperform traditional methods, especially in low signal-to-noise ratio (SNR) and complex noise environments. The core idea is to &lt;strong>let the model automatically learn the distinguishing features between speech and non-speech from data&lt;/strong>, rather than relying on manually designed rules.&lt;/p>
&lt;p>The general workflow for these models is as follows:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[&amp;quot;Audio Input&amp;quot;] --&amp;gt; B[&amp;quot;Feature Extraction&amp;lt;br&amp;gt;(e.g., MFCC, Fbank)&amp;quot;]
B --&amp;gt; C[&amp;quot;Deep Neural Network&amp;lt;br&amp;gt;(CNN, RNN, Transformer, etc.)&amp;quot;]
C --&amp;gt; D[&amp;quot;Output Layer&amp;lt;br&amp;gt;(Sigmoid/Softmax)&amp;quot;]
D --&amp;gt; E[&amp;quot;Speech/Non-speech Probability&amp;quot;]
E --&amp;gt; F{&amp;quot;Post-processing&amp;lt;br&amp;gt;(Threshold, Smoothing)&amp;quot;}
F --&amp;gt; G[&amp;quot;Final Decision&amp;quot;]
&lt;/code>&lt;/pre>
&lt;h2 id="4-indepth-analysis-of-the-silero-vad-model">4. In-depth Analysis of the Silero VAD Model&lt;/h2>
&lt;p>&lt;strong>Silero VAD&lt;/strong> is one of the leading VAD models in the industry, renowned for its &lt;strong>extremely high accuracy, amazing computational efficiency, and universality across multiple languages&lt;/strong>. Its achievements are primarily based on the &lt;code>snakers4/silero-vad&lt;/code> repository.&lt;/p>
&lt;h3 id="41-core-features">4.1 Core Features&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>High Precision&lt;/strong>: Its accuracy rivals or even surpasses many large, complex models in various noisy environments.&lt;/li>
&lt;li>&lt;strong>Extremely Lightweight&lt;/strong>: The model size is very small (typically less than 1MB), making it easy to deploy on browsers, mobile devices, and even embedded systems.&lt;/li>
&lt;li>&lt;strong>Language-Independent&lt;/strong>: It is not trained on specific languages but learns the universal acoustic characteristics of human speech, making it effective for almost all languages worldwide.&lt;/li>
&lt;li>&lt;strong>Real-time Performance&lt;/strong>: Extremely low processing latency, making it ideal for real-time communication applications.&lt;/li>
&lt;/ul>
&lt;h3 id="42-model-architecture">4.2 Model Architecture&lt;/h3>
&lt;p>The core architecture of Silero VAD is a hybrid &lt;strong>CNN + GRU&lt;/strong> model. This architecture combines the advantages of both:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>CNN (Convolutional Neural Network)&lt;/strong>: Used to extract local features with translation invariance from raw audio or spectrograms. CNNs can effectively capture the instantaneous characteristics of sound events.&lt;/li>
&lt;li>&lt;strong>GRU (Gated Recurrent Unit)&lt;/strong>: A type of RNN (Recurrent Neural Network) used to process sequential data. It can capture the contextual dependencies of audio signals in the time dimension, such as the beginning and end of a syllable.&lt;/li>
&lt;/ul>
&lt;p>Its detailed architecture can be macroscopically understood as:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Silero VAD Model&amp;quot;
A[&amp;quot;Input Audio Chunk&amp;lt;br&amp;gt; (e.g., 30ms, 16kHz)&amp;quot;] --&amp;gt; B(&amp;quot;Single-layer CNN&amp;quot;)
B --&amp;gt; C(&amp;quot;Multi-layer GRU&amp;quot;)
C --&amp;gt; D(&amp;quot;Fully Connected Layer&amp;quot;)
D --&amp;gt; E[&amp;quot;Output&amp;lt;br&amp;gt;(Sigmoid Activation)&amp;quot;]
end
E --&amp;gt; F[&amp;quot;Speech Probability (0-1)&amp;quot;]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Diving into the details&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Input&lt;/strong>: The model receives a small segment of audio as input, such as a 480-sample chunk (equivalent to 30 milliseconds at a 16kHz sampling rate). The model processes &lt;strong>chunk-by-chunk&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Feature Extraction&lt;/strong>: Unlike many models, Silero VAD may operate directly on raw waveforms or very low-level features, with the first CNN layer automatically learning effective acoustic features, rather than relying on manually designed features like MFCC.&lt;/li>
&lt;li>&lt;strong>CNN Layer&lt;/strong>: This layer acts like a filter bank, scanning the input audio chunk to capture phoneme-level micro-patterns.&lt;/li>
&lt;li>&lt;strong>GRU Layer&lt;/strong>: This is the memory core of the model. The feature vector of each audio chunk after CNN processing is fed into the GRU. The internal state of the GRU is updated based on the current input and the previous state. This allows the model to understand &amp;ldquo;whether the sound I'm hearing now is a continuation of the previous sound or the beginning of a completely new sound event.&amp;rdquo; This is crucial for accurately judging the first word after a long silence or brief pauses in the middle of a sentence.&lt;/li>
&lt;li>&lt;strong>Fully Connected Layer &amp;amp; Output&lt;/strong>: The output of the GRU goes through one or more fully connected layers for integration, and finally through a &lt;code>Sigmoid&lt;/code> function, outputting a floating-point number between 0 and 1. This number represents &lt;strong>the probability that the current input audio chunk contains speech&lt;/strong>.&lt;/li>
&lt;/ol>
&lt;h3 id="43-technical-implementation-details">4.3 Technical Implementation Details&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>State Maintenance (Stateful)&lt;/strong>: To process continuous audio streams, Silero VAD is a stateful model. You need to maintain an internal state of the model (mainly the hidden state of the GRU) for each independent audio stream. After processing an audio chunk, the model's hidden state needs to be saved and used as input for processing the next audio chunk. This enables uninterrupted real-time detection.&lt;/li>
&lt;li>&lt;strong>Sampling Rate Support&lt;/strong>: Typically supports 8kHz and 16kHz, which are the most common sampling rates in voice communication.&lt;/li>
&lt;li>&lt;strong>Audio Chunk Size&lt;/strong>: The model has strict requirements for the size of input audio chunks, such as 256, 512, 768 (8kHz) or 512, 1024, 1536 (16kHz) samples. Developers need to buffer and segment the audio stream from microphones or networks into these fixed-size chunks.&lt;/li>
&lt;li>&lt;strong>Post-processing&lt;/strong>: The model only outputs the speech probability for a single chunk. In practical applications, a simple post-processing logic is also needed. For example:
&lt;ul>
&lt;li>&lt;code>trigger_level&lt;/code>: Speech activation threshold (e.g., 0.5).&lt;/li>
&lt;li>&lt;code>speech_pad_ms&lt;/code>: Additional audio retention after the speech end signal is issued, to prevent premature cutting.&lt;/li>
&lt;li>&lt;code>min_silence_duration_ms&lt;/code>: Minimum duration required to be classified as a silence segment.&lt;/li>
&lt;li>&lt;code>min_speech_duration_ms&lt;/code>: Minimum duration required to be classified as a speech segment, preventing brief noises (like coughs) from being misclassified as speech.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="5-application-of-vad-in-realtime-voice-communication">5. Application of VAD in Real-time Voice Communication&lt;/h2>
&lt;h3 id="51-frontend-applications-browserclient">5.1 Frontend Applications (Browser/Client)&lt;/h3>
&lt;p>Running VAD on the frontend allows processing of voice data before it leaves the user's device, achieving maximum bandwidth savings and minimal latency.&lt;/p>
&lt;p>&lt;strong>Typical Scenarios&lt;/strong>: Web-based online meetings, browser-embedded customer service dialogue systems.&lt;/p>
&lt;p>&lt;strong>Implementation Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant Mic as Microphone
participant Browser
participant VAD as &amp;quot;Silero VAD (WASM/ONNX.js)&amp;quot;
participant Network as Network Module
User-&amp;gt;&amp;gt;Mic: Start speaking
Mic-&amp;gt;&amp;gt;Browser: Capture raw audio stream
Browser-&amp;gt;&amp;gt;Browser: Get audio via WebAudio API
Note right of Browser: &amp;quot;Create AudioContext and&amp;lt;br&amp;gt;ScriptProcessorNode/AudioWorklet&amp;quot;
loop Real-time Processing
Browser-&amp;gt;&amp;gt;VAD: Pass fixed-size audio chunk
VAD-&amp;gt;&amp;gt;VAD: Calculate speech probability
VAD--&amp;gt;&amp;gt;Browser: &amp;quot;Return speech probability (e.g., 0.9)&amp;quot;
end
Browser-&amp;gt;&amp;gt;Browser: Judge based on probability and post-processing logic
alt Speech Detected
Browser-&amp;gt;&amp;gt;Network: Encode and send the audio chunk
else No Speech Detected
Browser-&amp;gt;&amp;gt;Network: Discard audio chunk or send DTX signal
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Technology Stack&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Audio Capture&lt;/strong>: &lt;code>navigator.mediaDevices.getUserMedia()&lt;/code>&lt;/li>
&lt;li>&lt;strong>Audio Processing&lt;/strong>: Web Audio API (&lt;code>AudioContext&lt;/code>, &lt;code>AudioWorkletNode&lt;/code>)&lt;/li>
&lt;li>&lt;strong>VAD Model Running&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>WebAssembly (WASM)&lt;/strong>: Compile the VAD inference engine implemented in C++/Rust into WASM for near-native performance. Silero officially provides such an implementation.&lt;/li>
&lt;li>&lt;strong>ONNX.js / TensorFlow.js&lt;/strong>: Convert the VAD model to ONNX or TF.js format to run directly in JavaScript, simpler to deploy but slightly lower performance than WASM.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="52-backend-applications-server">5.2 Backend Applications (Server)&lt;/h3>
&lt;p>Running VAD on the backend allows centralized processing of all incoming audio streams, suitable for scenarios where client behavior cannot be controlled, or server-side recording and analysis are needed.&lt;/p>
&lt;p>&lt;strong>Typical Scenarios&lt;/strong>: ASR as a service, mixing and recording of multi-party calls, intelligent voice monitoring.&lt;/p>
&lt;p>&lt;strong>Implementation Process&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant Client
participant Server as &amp;quot;Voice Server (e.g., WebRTC SFU)&amp;quot;
participant VAD as Backend VAD Module
participant ASR as ASR Service
Client-&amp;gt;&amp;gt;Server: &amp;quot;Send continuous audio stream (RTP packets)&amp;quot;
Server-&amp;gt;&amp;gt;VAD: Feed decoded audio stream into VAD module
Note right of VAD: &amp;quot;Maintain an independent VAD state&amp;lt;br&amp;gt;for each client connection&amp;quot;
loop Real-time Processing
VAD-&amp;gt;&amp;gt;VAD: Process chunk by chunk, calculate speech probability
VAD--&amp;gt;&amp;gt;Server: &amp;quot;Return 'Speech Start' / 'Speech Continue' / 'Speech End' events&amp;quot;
end
alt &amp;quot;Speech Start&amp;quot; Event
Server-&amp;gt;&amp;gt;ASR: Create a new ASR task, start sending subsequent speech data
else &amp;quot;Speech End&amp;quot; Event
Server-&amp;gt;&amp;gt;ASR: End the ASR task, get recognition results
end
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Technology Stack&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Voice Server&lt;/strong>: Open-source projects like &lt;code>livekit&lt;/code>, &lt;code>ion-sfu&lt;/code>, or self-developed media servers.&lt;/li>
&lt;li>&lt;strong>VAD Module&lt;/strong>: Typically implemented in Python, C++, or Go, directly calling Silero's PyTorch model or its ONNX/C++ implementation.&lt;/li>
&lt;li>&lt;strong>Inter-service Communication&lt;/strong>: If VAD is an independent microservice, gRPC or message queues can be used to communicate with the main business server.&lt;/li>
&lt;/ul>
&lt;h2 id="6-summary-and-outlook">6. Summary and Outlook&lt;/h2>
&lt;p>Although VAD seems like a simple task, it is the cornerstone of building efficient, intelligent voice applications.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Traditional VAD&lt;/strong> is simple and fast but struggles in complex scenarios.&lt;/li>
&lt;li>&lt;strong>Modern deep learning VAD represented by Silero VAD&lt;/strong>, through clever model design, has achieved a perfect balance in &lt;strong>accuracy, efficiency, and universality&lt;/strong>, pushing high-quality VAD technology to unprecedented popularity, making it easy to deploy on any device from cloud to edge.&lt;/li>
&lt;/ul>
&lt;p>In the future, VAD technology may evolve in more refined directions, such as:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Deeper integration with noise suppression&lt;/strong>: Not just detecting speech, but directly outputting clean speech.&lt;/li>
&lt;li>&lt;strong>Multimodal detection&lt;/strong>: Combining lip movement information from video (Lip-VAD) to achieve even greater accuracy.&lt;/li>
&lt;li>&lt;strong>More complex acoustic scene understanding&lt;/strong>: Not only distinguishing between speech and non-speech but also differentiating between different types of non-speech (such as music, applause, environmental noise), providing richer contextual information for downstream tasks.&lt;/li>
&lt;/ul></description></item><item><title>SGLang Technical Guide: High-Performance Structured Generation Framework</title><link>https://ziyanglin.netlify.app/en/post/sglang-documentation/</link><pubDate>Thu, 26 Jun 2025 01:07:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/sglang-documentation/</guid><description>&lt;h2 id="1-sglang-introduction">1. SGLang Introduction&lt;/h2>
&lt;p>SGLang (Structured Generation Language) is a high-performance service framework designed for large language models (LLMs) and vision language models (VLMs). Its core goal is to address the challenges faced by complex LLM programs in real-world applications, maximizing inference performance while maintaining flexibility.&lt;/p>
&lt;p>Traditional LLM service frameworks (like vLLM) excel at handling simple, one-shot prompting but face limitations in complex scenarios requiring multi-turn interactions, structured outputs, function calls, or control flow. SGLang effectively bridges this gap by introducing a novel frontend language and an efficient backend runtime.&lt;/p>
&lt;p>&lt;strong>Core advantages of SGLang include:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Exceptional Performance:&lt;/strong> SGLang introduces &lt;strong>RadixAttention&lt;/strong>, an innovative attention mechanism that automatically and losslessly reuses key-value caches (KV Cache), significantly improving inference speed in scenarios with complex prompts (like CoT, ReAct) or multi-turn conversations. Compared to leading frameworks like vLLM, SGLang can achieve several times higher throughput in these scenarios.&lt;/li>
&lt;li>&lt;strong>Powerful Programming Capabilities:&lt;/strong> SGLang provides an intuitive domain-specific language (DSL) that allows developers to orchestrate complex generation tasks in a Pythonic way. You can easily define variables, use loops and conditional statements, call external tools, and seamlessly integrate these logic elements with the LLM's generation process. This makes building complex AI agents, multi-turn dialogue systems, and structured data extraction tasks unprecedentedly simple.&lt;/li>
&lt;li>&lt;strong>Unified Frontend-Backend Interface:&lt;/strong> SGLang decouples frontend programming logic from backend inference services. The frontend defines &amp;ldquo;what to generate,&amp;rdquo; while the backend handles &amp;ldquo;how to efficiently generate it.&amp;rdquo; This design not only simplifies the development process but also makes SGLang compatible with OpenAI's API standards, allowing users to easily migrate existing applications to SGLang and immediately benefit from performance gains.&lt;/li>
&lt;li>&lt;strong>Flexible Structured Output:&lt;/strong> SGLang provides powerful structured output constraint capabilities. Whether through regular expressions, EBNF grammar, or JSON Schema, you can precisely control the output format of the LLM, ensuring that the generated content conforms to the expected structure, which is crucial for applications requiring reliable data formats.&lt;/li>
&lt;/ul>
&lt;p>In summary, SGLang is not just an LLM inference acceleration engine but a complete programming and execution framework for complex generation tasks. It aims to enable developers to fully unleash the potential of large language models in an efficient and intuitive way.&lt;/p>
&lt;h2 id="2-core-features">2. Core Features&lt;/h2>
&lt;p>The power of SGLang lies in its unique design, which combines an intuitive frontend programming model with an efficient backend execution engine. Below are detailed introductions to several of its core features.&lt;/p>
&lt;h3 id="21-radixattention-kv-cache-optimization-for-complex-prompts">2.1 RadixAttention: KV Cache Optimization for Complex Prompts&lt;/h3>
&lt;p>When processing complex LLM programs, such as Chain-of-Thought, multi-turn dialogues, or agents that need to call tools, prompts often contain large shared prefixes. Traditional attention mechanisms produce redundant computation and storage when handling these shared prefixes.&lt;/p>
&lt;p>SGLang introduces &lt;strong>RadixAttention&lt;/strong>, a novel KV cache optimization technique. Its core idea is to organize prompts into a radix tree and perform attention calculations on this tree.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Automatic Sharing and Reuse&lt;/strong>: RadixAttention can automatically identify and share common prefixes between different requests, avoiding duplicate computation and storage. For example, in multi-turn dialogues, the conversation history of each turn can be losslessly reused by subsequent turns.&lt;/li>
&lt;li>&lt;strong>Performance Improvement&lt;/strong>: By maximizing KV cache reuse, RadixAttention significantly reduces memory usage and computational load, increasing throughput by 2 to 5 times, especially when handling long prompts or high-concurrency requests.&lt;/li>
&lt;/ul>
&lt;p>Below is a Mermaid diagram that visually demonstrates how RadixAttention handles requests with shared prefixes:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph &amp;quot;Traditional Method (No Sharing)&amp;quot;
req1[&amp;quot;Request 1: 'A B C D'&amp;quot;]
req2[&amp;quot;Request 2: 'A B E F'&amp;quot;]
kv1[&amp;quot;KV Cache: [A, B, C, D]&amp;quot;]
kv2[&amp;quot;KV Cache: [A, B, E, F]&amp;quot;]
req1 --&amp;gt; kv1
req2 --&amp;gt; kv2
end
subgraph &amp;quot;SGLang RadixAttention&amp;quot;
Root(&amp;quot;Root&amp;quot;) --&amp;gt; A(&amp;quot;Token 'A'&amp;quot;);
A --&amp;gt; B(&amp;quot;Token 'B'&amp;quot;);
B --&amp;gt; C(&amp;quot;Token 'C'&amp;quot;);
B --&amp;gt; E(&amp;quot;Token 'E'&amp;quot;);
C --&amp;gt; D(&amp;quot;Token 'D'&amp;quot;);
E --&amp;gt; F(&amp;quot;Token 'F'&amp;quot;);
style A fill:#9f9
style B fill:#9f9
end
&lt;/code>&lt;/pre>
&lt;p>In the diagram above, for two requests &lt;code>'A B C D'&lt;/code> and &lt;code>'A B E F'&lt;/code>, the traditional method creates two independent KV caches. RadixAttention, however, organizes them into a tree, sharing the computation and storage of the common prefix &lt;code>'A B'&lt;/code> (green nodes), creating new branches only for the different parts (C, D, E, F). This greatly improves memory and computational efficiency.&lt;/p>
&lt;h3 id="22-unified-frontend-programming-language-dsl">2.2 Unified Frontend Programming Language (DSL)&lt;/h3>
&lt;p>SGLang provides an expressive domain-specific language (DSL) deeply integrated with Python, allowing developers to build complex generation logic in a natural and intuitive way.&lt;/p>
&lt;h3 id="sglang-architecture-overview">SGLang Architecture Overview&lt;/h3>
&lt;p>To better understand how SGLang works, we can observe its core architecture through the following flowchart:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph User Side
A[Developer defines SGLang program&amp;lt;br&amp;gt;using function decorator] --&amp;gt; B{Call run method};
end
subgraph SGLang Frontend
B --&amp;gt; C[1. Parse Python AST&amp;lt;br&amp;gt;Separate deterministic logic and generation instructions];
C --&amp;gt; D[2. Build portable&amp;lt;br&amp;gt;SGLang IR intermediate representation];
end
subgraph Network Communication
D -- HTTP Request --&amp;gt; E[SGLang backend service SRT];
end
subgraph SGLang Backend SRT
E --&amp;gt; F[3. Receive IR and schedule];
F --&amp;gt; G{RadixAttention engine};
G --&amp;gt; H[4. Efficient execution&amp;lt;br&amp;gt;KV cache reuse];
H --&amp;gt; I[LLM/VLM model];
I --&amp;gt; J[5. Generate results];
end
subgraph Return Path
J -- HTTP Response --&amp;gt; K[Return results to frontend];
K --&amp;gt; L[6. Fill state object `s`];
L --&amp;gt; M[User gets final results];
end
style B fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style G fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>This diagram clearly shows how SGLang decouples and combines the programming convenience of the frontend with the high-performance execution engine of the backend.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Pythonic Control Flow&lt;/strong>: You can directly use standard Python control flow statements like &lt;code>if/else&lt;/code> and &lt;code>for&lt;/code> loops in SGLang functions to dynamically build prompts.&lt;/li>
&lt;li>&lt;strong>Integration of Generation and Logic&lt;/strong>: Through the &lt;code>@function&lt;/code> decorator and &lt;code>gen()&lt;/code> instruction, SGLang seamlessly combines the LLM's generation process (the &amp;ldquo;non-deterministic&amp;rdquo; part) with the program's deterministic logic.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: Generating Different Content Based on Conditions&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen
@function
def tool_use(s, question):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question)
s += assistant(
&amp;quot;To answer this question, I need to use a &amp;quot;
+ gen(&amp;quot;tool&amp;quot;, choices=[&amp;quot;calculator&amp;quot;, &amp;quot;search engine&amp;quot;])
+ &amp;quot;. &amp;quot;
)
if s[&amp;quot;tool&amp;quot;] == &amp;quot;calculator&amp;quot;:
s += assistant(&amp;quot;The math expression is: &amp;quot; + gen(&amp;quot;expression&amp;quot;))
elif s[&amp;quot;tool&amp;quot;] == &amp;quot;search engine&amp;quot;:
s += assistant(&amp;quot;The key word to search is: &amp;quot; + gen(&amp;quot;word&amp;quot;))
state = tool_use.run(&amp;quot;What is the population of London?&amp;quot;)
print(state[&amp;quot;tool&amp;quot;])
# Output: search engine
print(state[&amp;quot;word&amp;quot;])
# Output: population of London
&lt;/code>&lt;/pre>
&lt;p>In this example, the program first asks the LLM to choose between &amp;ldquo;calculator&amp;rdquo; and &amp;ldquo;search engine&amp;rdquo; as a tool, then executes different logic branches based on the LLM's choice, guiding the LLM to generate the next step of content.&lt;/p>
&lt;h3 id="23-powerful-structured-output">2.3 Powerful Structured Output&lt;/h3>
&lt;p>To ensure that content generated by the LLM can be reliably parsed and used by downstream programs, SGLang provides multiple powerful structured output constraint mechanisms.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Regular Expressions (Regex)&lt;/strong>: You can provide a regular expression to force the model's output to strictly match that pattern. This is useful for generating identifiers, numbers, or simple text fragments in specific formats.&lt;/p>
&lt;pre>&lt;code class="language-python">response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France?&amp;quot;}],
extra_body={&amp;quot;regex&amp;quot;: &amp;quot;(Paris|London)&amp;quot;},
)
# response.choices[0].message.content will necessarily be &amp;quot;Paris&amp;quot; or &amp;quot;London&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>EBNF Grammar&lt;/strong>: For more complex grammatical structures, you can use Extended Backus-Naur Form (EBNF) to define a complete grammar. This allows you to generate code, DSLs, or other structured text that strictly adheres to specific syntax.&lt;/p>
&lt;pre>&lt;code class="language-python">ebnf_grammar = &amp;quot;&amp;quot;&amp;quot;
root ::= city &amp;quot; is the capital of &amp;quot; country
city ::= &amp;quot;London&amp;quot; | &amp;quot;Paris&amp;quot; | &amp;quot;Berlin&amp;quot; | &amp;quot;Rome&amp;quot;
country ::= &amp;quot;England&amp;quot; | &amp;quot;France&amp;quot; | &amp;quot;Germany&amp;quot; | &amp;quot;Italy&amp;quot;
&amp;quot;&amp;quot;&amp;quot;
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information of the capital of France.&amp;quot;}],
extra_body={&amp;quot;ebnf&amp;quot;: ebnf_grammar},
)
# response.choices[0].message.content will be &amp;quot;Paris is the capital of France&amp;quot;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>JSON Schema&lt;/strong>: SGLang supports using JSON Schema to constrain the model to generate structured JSON objects. You can directly define a JSON Schema or use a Pydantic model to automatically generate one. This is crucial for APIs and data processing tasks that require reliable, verifiable JSON output.&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
class CapitalInfo(BaseModel):
name: str
population: int
response = client.chat.completions.create(
model=&amp;quot;deepseek-ai/DeepSeek-R1-Distill-Qwen-7B&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Give me the information and population of the capital of France in the JSON format.&amp;quot;}],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;capital_info&amp;quot;,
&amp;quot;schema&amp;quot;: CapitalInfo.model_json_schema(),
},
},
)
# response.choices[0].message.content will be a JSON string conforming to the CapitalInfo structure
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h2 id="3-quick-start">3. Quick Start&lt;/h2>
&lt;p>This section will guide you through installing SGLang, starting the service, and basic usage, allowing you to experience SGLang's powerful features in just a few minutes.&lt;/p>
&lt;h3 id="31-installation">3.1 Installation&lt;/h3>
&lt;p>SGLang can be installed via &lt;code>pip&lt;/code> or the faster &lt;code>uv&lt;/code>. For the best experience and full functionality, it's recommended to install the &lt;code>all&lt;/code> version.&lt;/p>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install --upgrade pip
pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using uv (recommended, faster):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install uv
uv pip install &amp;quot;sglang[all]&amp;quot;
&lt;/code>&lt;/pre>
&lt;blockquote>
&lt;p>&lt;strong>Note&lt;/strong>: The installation process may require compiling CUDA kernels (such as &lt;code>flashinfer&lt;/code>). Please ensure that the &lt;code>CUDA_HOME&lt;/code> environment variable is correctly configured in your environment and that the CUDA version is compatible with your PyTorch version.&lt;/p>
&lt;/blockquote>
&lt;h3 id="32-starting-the-backend-service-srt">3.2 Starting the Backend Service (SRT)&lt;/h3>
&lt;p>After installation, the next step is to start SGLang's backend service (SRT, SGLang Runtime). This service will load the specified language model and provide an interface compatible with the OpenAI API.&lt;/p>
&lt;p>Run the following command in your terminal:&lt;/p>
&lt;pre>&lt;code class="language-bash">python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Parameter Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>--model-path&lt;/code>: Specifies the path to the model to load. This can be a model name on the Hugging Face Hub (as shown in this example) or a local model path.&lt;/li>
&lt;li>&lt;code>--host&lt;/code>: The host address the service listens on. &lt;code>0.0.0.0&lt;/code> means allowing access from any network interface.&lt;/li>
&lt;li>&lt;code>--port&lt;/code>: The port number the service listens on.&lt;/li>
&lt;/ul>
&lt;p>When the service starts successfully, you'll see output similar to the following, indicating that the model has been loaded and is ready to receive requests.&lt;/p>
&lt;pre>&lt;code>INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
INFO: Started server process [12345]
INFO: Waiting for application startup.
INFO: Application startup complete.
&lt;/code>&lt;/pre>
&lt;h3 id="33-sending-your-first-request">3.3 Sending Your First Request&lt;/h3>
&lt;p>With the service running, we can now interact with it using OpenAI's Python client library.&lt;/p>
&lt;p>Create a Python file named &lt;code>test_sglang.py&lt;/code> and fill it with the following content:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
# Initialize the client, pointing to our locally started SGLang service
client = openai.Client(
base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;,
api_key=&amp;quot;EMPTY&amp;quot; # SGLang service doesn't require an API Key
)
# Create a chat completion request
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;, # Must match the model loaded by the service
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What is the capital of France and why is it famous?&amp;quot;},
],
temperature=0.7,
max_tokens=150,
)
# Print the model's response
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;p>Run this script:&lt;/p>
&lt;pre>&lt;code class="language-bash">python test_sglang.py
&lt;/code>&lt;/pre>
&lt;p>You'll see the model's detailed answer about Paris. At this point, you've successfully completed the entire process from service deployment to inference request using SGLang!&lt;/p>
&lt;h2 id="4-frontend-language-sglang-dsl">4. Frontend Language (SGLang DSL)&lt;/h2>
&lt;p>SGLang's frontend language (DSL) is the core of its usability. It allows you to define complex generation processes in a declarative way, perfectly combining Python's flexibility with the generative capabilities of LLMs.&lt;/p>
&lt;h3 id="41-function-decorator">4.1 &lt;code>@function&lt;/code> Decorator&lt;/h3>
&lt;p>All SGLang programs begin with a Python function decorated by &lt;code>@function&lt;/code>. This decorator transforms an ordinary Python function into an executable SGLang program template.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>State Management&lt;/strong>: The first parameter of the function (typically named &lt;code>s&lt;/code>) represents the current generation state. It's a dictionary-like object used to store and pass all variables produced during the generation process.&lt;/li>
&lt;li>&lt;strong>Delayed Execution&lt;/strong>: Functions decorated with &lt;code>@function&lt;/code> are not executed immediately when defined. Instead, they create a reusable template. The program only executes when the &lt;code>.run()&lt;/code> or &lt;code>.run_batch()&lt;/code> method is called.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Interaction Flow&lt;/strong>&lt;/p>
&lt;p>The entire function call interaction flow can be represented by the following sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant User
participant App as Application (Python)
participant SGLang as SGLang Service
participant Tool as External Tool (e.g., Weather API)
User-&amp;gt;&amp;gt;+App: &amp;quot;What's the weather like in Boston?&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send request with messages and tools
SGLang-&amp;gt;&amp;gt;SGLang: Model decides to call get_current_weather
SGLang--&amp;gt;&amp;gt;-App: Return tool_calls with function name and parameters
App-&amp;gt;&amp;gt;App: Parse tool_calls
App-&amp;gt;&amp;gt;+Tool: Call get_current_weather(city=&amp;quot;Boston&amp;quot;, unit=&amp;quot;fahrenheit&amp;quot;)
Tool--&amp;gt;&amp;gt;-App: Return weather result: &amp;quot;68°F&amp;quot;
App-&amp;gt;&amp;gt;+SGLang: Send new request with weather result
SGLang-&amp;gt;&amp;gt;SGLang: Model generates final reply based on weather result
SGLang--&amp;gt;&amp;gt;-App: Return final natural language reply
App--&amp;gt;&amp;gt;-User: &amp;quot;It's currently 68°F in Boston.&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>This sequence diagram clearly shows the complete loop from user question to model decision, tool call, result integration, and final response.&lt;/p>
&lt;h3 id="42-core-instructions">4.2 Core Instructions&lt;/h3>
&lt;p>Within SGLang functions, you use a series of instructions to build prompts and control the generation flow.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Role Instructions&lt;/strong>: &lt;code>system()&lt;/code>, &lt;code>user()&lt;/code>, &lt;code>assistant()&lt;/code>
These instructions are used to define different parts of a conversation, conforming to the standard multi-turn dialogue format. You can pass strings directly to them.&lt;/li>
&lt;li>&lt;strong>Generation Instruction&lt;/strong>: &lt;code>gen()&lt;/code>
This is the most important instruction in SGLang. It tells the LLM to generate text at the current position.
&lt;ul>
&lt;li>&lt;code>s += gen(&amp;quot;variable_name&amp;quot;, ...)&lt;/code>: The first parameter of &lt;code>gen()&lt;/code> is required and specifies the variable name in which the generation result will be stored in the state &lt;code>s&lt;/code>.&lt;/li>
&lt;li>&lt;code>max_tokens&lt;/code>: Limits the maximum number of tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code>: Defines one or more stop strings. When the model generates these strings, the generation process ends early.&lt;/li>
&lt;li>&lt;code>choices&lt;/code>: Provides a list of strings, forcing the model to choose one of these options for generation.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example: A Complete Frontend Function&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
# Set the backend to the OpenAI-compatible service provided by SGLang
set_default_backend(OpenAI(&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;))
@function
def multi_turn_qa(s, question1, question2):
s += system(&amp;quot;You are a helpful assistant.&amp;quot;)
s += user(question1)
s += assistant(gen(&amp;quot;answer1&amp;quot;, max_tokens=128))
s += user(question2)
s += assistant(gen(&amp;quot;answer2&amp;quot;, max_tokens=128))
# Execute the SGLang program
state = multi_turn_qa.run(
question1=&amp;quot;What is the capital of the UK?&amp;quot;,
question2=&amp;quot;What is its population?&amp;quot;,
temperature=0.1
)
print(&amp;quot;Answer 1:&amp;quot;, state[&amp;quot;answer1&amp;quot;])
print(&amp;quot;Answer 2:&amp;quot;, state[&amp;quot;answer2&amp;quot;])
&lt;/code>&lt;/pre>
&lt;h3 id="43-streaming-output">4.3 Streaming Output&lt;/h3>
&lt;p>For applications requiring real-time feedback, SGLang supports streaming output. Simply set &lt;code>stream=True&lt;/code> in the &lt;code>.run()&lt;/code> method and iterate over the &lt;code>.text_iter()&lt;/code> method of the returned state object.&lt;/p>
&lt;pre>&lt;code class="language-python">state = multi_turn_qa.run(
question1=&amp;quot;Write a short story about a robot.&amp;quot;,
question2=&amp;quot;Continue the story.&amp;quot;,
stream=True
)
for out in state.text_iter(&amp;quot;answer2&amp;quot;):
print(out, end=&amp;quot;&amp;quot;, flush=True)
&lt;/code>&lt;/pre>
&lt;h2 id="5-backend-service-srt-and-api-reference">5. Backend Service (SRT) and API Reference&lt;/h2>
&lt;p>SGLang's backend, the SGLang Runtime (SRT), is a high-performance inference server implemented in Python. It's responsible for loading models, managing KV caches (through RadixAttention), and handling requests from clients. SRT provides two main API endpoints.&lt;/p>
&lt;h3 id="51-native-api-generate">5.1 Native API: &lt;code>/generate&lt;/code>&lt;/h3>
&lt;p>This is a lower-level API that provides the finest control over the generation process.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /generate&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Generate text starting from a given text prompt.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>text&lt;/code> (string, required): The input text prompt.&lt;/li>
&lt;li>&lt;code>sampling_params&lt;/code> (object, optional): A JSON object containing sampling parameters.
&lt;ul>
&lt;li>&lt;code>temperature&lt;/code> (float): Sampling temperature.&lt;/li>
&lt;li>&lt;code>max_new_tokens&lt;/code> (int): Maximum number of new tokens to generate.&lt;/li>
&lt;li>&lt;code>stop&lt;/code> (string or list[string]): Stop tokens.&lt;/li>
&lt;li>&lt;code>json_schema&lt;/code> (string): JSON Schema string for constraining output.&lt;/li>
&lt;li>&lt;code>regex&lt;/code> (string): Regular expression for constraining output.&lt;/li>
&lt;li>&lt;code>ebnf&lt;/code> (string): EBNF grammar for constraining output.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;code>stream&lt;/code> (boolean, optional): Whether to use streaming.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using &lt;code>requests&lt;/code>)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import requests
import json
url = &amp;quot;http://127.0.0.1:30000/generate&amp;quot;
data = {
&amp;quot;text&amp;quot;: &amp;quot;The capital of France is&amp;quot;,
&amp;quot;sampling_params&amp;quot;: {
&amp;quot;temperature&amp;quot;: 0,
&amp;quot;max_new_tokens&amp;quot;: 16,
}
}
response = requests.post(url, json=data)
print(response.json())
# {'text': ' Paris.\n\nThe capital of France is Paris. It is the most populous city in', 'meta': ...}
&lt;/code>&lt;/pre>
&lt;h3 id="52-openai-compatible-api-v1chatcompletions">5.2 OpenAI Compatible API: &lt;code>/v1/chat/completions&lt;/code>&lt;/h3>
&lt;p>For easy migration and integration, SGLang provides a chat completion API fully compatible with OpenAI. You can seamlessly use OpenAI's official client library.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Endpoint&lt;/strong>: &lt;code>POST /v1/chat/completions&lt;/code>&lt;/li>
&lt;li>&lt;strong>Description&lt;/strong>: Perform chat-style text generation.&lt;/li>
&lt;li>&lt;strong>Core Parameters&lt;/strong>:
&lt;ul>
&lt;li>&lt;code>model&lt;/code> (string, required): The name of the model.&lt;/li>
&lt;li>&lt;code>messages&lt;/code> (list[object], required): List of conversation messages.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code>, &lt;code>max_tokens&lt;/code>, &lt;code>stream&lt;/code>, etc.&lt;/li>
&lt;li>&lt;code>response_format&lt;/code> (object, optional): For specifying structured output, such as &lt;code>{&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;, &amp;quot;json_schema&amp;quot;: ...}&lt;/code>.&lt;/li>
&lt;li>&lt;code>extra_body&lt;/code> (object, optional): SGLang-specific extension parameters, such as &lt;code>{&amp;quot;regex&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code> or &lt;code>{&amp;quot;ebnf&amp;quot;: &amp;quot;...&amp;quot;}&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example (using the &lt;code>openai&lt;/code> library)&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-python">import openai
client = openai.Client(base_url=&amp;quot;http://127.0.0.1:30000/v1&amp;quot;, api_key=&amp;quot;EMPTY&amp;quot;)
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;List 3 countries and their capitals.&amp;quot;}],
temperature=0,
max_tokens=64,
)
print(response.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-usage-function-callingtool-usage">6. Advanced Usage: Function Calling/Tool Usage&lt;/h2>
&lt;p>SGLang's powerful programming model makes it very suitable for building AI agents capable of calling external tools. This is typically achieved through structured output, where the model is guided to generate text in a specific format (usually JSON) describing a function call.&lt;/p>
&lt;p>Here are the steps to build a simple weather query agent:&lt;/p>
&lt;p>&lt;strong>1. Define Tool Schema&lt;/strong>&lt;/p>
&lt;p>First, use JSON Schema to define your tool. This tells the model the name of the tool, its purpose, and what parameters it needs.&lt;/p>
&lt;pre>&lt;code class="language-python">tools = [
{
&amp;quot;type&amp;quot;: &amp;quot;function&amp;quot;,
&amp;quot;function&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;get_current_weather&amp;quot;,
&amp;quot;description&amp;quot;: &amp;quot;Get the current weather in a given location&amp;quot;,
&amp;quot;parameters&amp;quot;: {
&amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
&amp;quot;properties&amp;quot;: {
&amp;quot;city&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;description&amp;quot;: &amp;quot;The city name&amp;quot;},
&amp;quot;unit&amp;quot;: {&amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;enum&amp;quot;: [&amp;quot;celsius&amp;quot;, &amp;quot;fahrenheit&amp;quot;]},
},
&amp;quot;required&amp;quot;: [&amp;quot;city&amp;quot;, &amp;quot;unit&amp;quot;],
},
},
}
]
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>2. Guide the Model to Make Function Calls&lt;/strong>&lt;/p>
&lt;p>In the &lt;code>messages&lt;/code> sent to the model, include a system prompt indicating that the model can use these tools. Then, pass &lt;code>tools&lt;/code> and &lt;code>tool_choice=&amp;quot;auto&amp;quot;&lt;/code> in the API call.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
messages = [
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant that can access external tools.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;What's the weather like in Boston in fahrenheit?&amp;quot;}
]
response = client.chat.completions.create(
model=&amp;quot;meta-llama/Meta-Llama-3.1-8B-Instruct&amp;quot;,
messages=messages,
tools=tools,
tool_choice=&amp;quot;auto&amp;quot;,
)
# Check if the model decided to call a tool
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
if tool_calls:
# Model decided to call a tool
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f&amp;quot;Function Call: {function_name}&amp;quot;)
print(f&amp;quot;Arguments: {function_args}&amp;quot;)
# Here, you could actually execute the function call
# e.g., result = get_current_weather(**function_args)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Output:&lt;/strong>&lt;/p>
&lt;pre>&lt;code>Function Call: get_current_weather
Arguments: {'city': 'Boston', 'unit': 'fahrenheit'}
&lt;/code>&lt;/pre>
&lt;p>In this way, you can build powerful AI applications capable of interacting with the external world.&lt;/p></description></item><item><title>Llama.cpp Technical Guide: Lightweight LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</link><pubDate>Thu, 26 Jun 2025 01:06:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.&lt;/p>
&lt;p>&lt;strong>Core Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Performance:&lt;/strong> Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).&lt;/li>
&lt;li>&lt;strong>Lightweight:&lt;/strong> Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.&lt;/li>
&lt;li>&lt;strong>Cross-Platform:&lt;/strong> Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.&lt;/li>
&lt;li>&lt;strong>Open Ecosystem:&lt;/strong> Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.&lt;/li>
&lt;li>&lt;strong>Continuous Innovation:&lt;/strong> Quickly follows and implements the latest model architectures and inference optimization techniques.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;h3 id="21-gguf-model-format">2.1. GGUF Model Format&lt;/h3>
&lt;p>GGUF (Georgi Gerganov Universal Format) is the core model file format used by &lt;code>llama.cpp&lt;/code>, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.&lt;/p>
&lt;p>&lt;strong>Key Features:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified File:&lt;/strong> Packages model metadata, vocabulary, and all tensors (weights) in a single file.&lt;/li>
&lt;li>&lt;strong>Extensibility:&lt;/strong> Allows adding new metadata without breaking compatibility.&lt;/li>
&lt;li>&lt;strong>Backward Compatibility:&lt;/strong> Guarantees compatibility with older versions of GGUF models.&lt;/li>
&lt;li>&lt;strong>Memory Efficiency:&lt;/strong> Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.&lt;/li>
&lt;/ul>
&lt;h3 id="22-quantization">2.2. Quantization&lt;/h3>
&lt;p>Quantization is one of the core advantages of &lt;code>llama.cpp&lt;/code>. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).&lt;/p>
&lt;p>&lt;strong>Main Benefits:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size:&lt;/strong> Significantly reduces the size of model files, making them easier to distribute and store.&lt;/li>
&lt;li>&lt;strong>Lower Memory Usage:&lt;/strong> Reduces the RAM required to load the model into memory.&lt;/li>
&lt;li>&lt;strong>Faster Inference:&lt;/strong> Low-precision calculations are typically faster than high-precision ones, especially on CPUs.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>llama.cpp&lt;/code> supports various quantization methods, particularly &lt;strong>k-quants&lt;/strong>, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.&lt;/p>
&lt;h3 id="23-multimodal-support">2.3. Multimodal Support&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Supported Models:&lt;/strong> Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.&lt;/li>
&lt;li>&lt;strong>Working Principle:&lt;/strong> Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.&lt;/li>
&lt;li>&lt;strong>Tools:&lt;/strong> &lt;code>llama-mtmd-cli&lt;/code> and &lt;code>llama-server&lt;/code> provide native support for multimodal models.&lt;/li>
&lt;/ul>
&lt;h2 id="3-usage-methods">3. Usage Methods&lt;/h2>
&lt;h3 id="31-compilation">3.1. Compilation&lt;/h3>
&lt;p>Compiling &lt;code>llama.cpp&lt;/code> from source is very simple.&lt;/p>
&lt;pre>&lt;code class="language-bash">git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make
&lt;/code>&lt;/pre>
&lt;p>For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:&lt;/p>
&lt;pre>&lt;code class="language-bash"># For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1
&lt;/code>&lt;/pre>
&lt;h3 id="32-basic-inference">3.2. Basic Inference&lt;/h3>
&lt;p>After compilation, you can use the &lt;code>llama-cli&lt;/code> tool for inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p &amp;quot;Building a website can be done in 10 simple steps:&amp;quot; -n 400
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>-m&lt;/code>: Specifies the path to the GGUF model file.&lt;/li>
&lt;li>&lt;code>-p&lt;/code>: Specifies the prompt.&lt;/li>
&lt;li>&lt;code>-n&lt;/code>: Specifies the maximum number of tokens to generate.&lt;/li>
&lt;/ul>
&lt;h3 id="33-openai-compatible-server">3.3. OpenAI Compatible Server&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.&lt;/p>
&lt;p>Starting the server:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096
&lt;/code>&lt;/pre>
&lt;p>You can then send requests to &lt;code>http://localhost:8080/v1/chat/completions&lt;/code> just like you would with the OpenAI API.&lt;/p>
&lt;h2 id="4-advanced-features">4. Advanced Features&lt;/h2>
&lt;h3 id="41-speculative-decoding">4.1. Speculative Decoding&lt;/h3>
&lt;p>This is an advanced inference optimization technique that significantly accelerates generation speed by using a small &amp;ldquo;draft&amp;rdquo; model to predict the output of the main model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle:&lt;/strong> The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.&lt;/li>
&lt;li>&lt;strong>Usage:&lt;/strong> Use the &lt;code>--draft-model&lt;/code> parameter in &lt;code>llama-cli&lt;/code> or &lt;code>llama-server&lt;/code> to specify a small, fast draft model.&lt;/li>
&lt;/ul>
&lt;h3 id="42-lora-support">4.2. LoRA Support&lt;/h3>
&lt;p>LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. &lt;code>llama.cpp&lt;/code> supports loading one or more LoRA adapters during inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base-model.gguf --lora lora-adapter.gguf
&lt;/code>&lt;/pre>
&lt;p>You can even set different weights for different LoRA adapters:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5
&lt;/code>&lt;/pre>
&lt;h3 id="43-grammars">4.3. Grammars&lt;/h3>
&lt;p>Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Format:&lt;/strong> Uses a format called GBNF (GGML BNF) to define grammar rules.&lt;/li>
&lt;li>&lt;strong>Application:&lt;/strong> By providing GBNF rules through the &lt;code>grammar&lt;/code> parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example:&lt;/strong> Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))
&lt;/code>&lt;/pre>
&lt;h2 id="5-ecosystem">5. Ecosystem&lt;/h2>
&lt;p>The success of &lt;code>llama.cpp&lt;/code> has spawned a vibrant ecosystem:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python&lt;/a>:&lt;/strong> The most popular Python binding, providing interfaces to almost all features of &lt;code>llama.cpp&lt;/code> and deeply integrated with frameworks like LangChain and LlamaIndex.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://ollama.com/">Ollama&lt;/a>:&lt;/strong> A tool for packaging, distributing, and running models, using &lt;code>llama.cpp&lt;/code> under the hood, greatly simplifying the process of running LLMs locally.&lt;/li>
&lt;li>&lt;strong>Numerous UI Tools:&lt;/strong> The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion">6. Conclusion&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), &lt;code>llama.cpp&lt;/code> provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.&lt;/p></description></item><item><title>vLLM Technical Guide: High-Performance LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/vllm-documentation/</link><pubDate>Thu, 26 Jun 2025 01:05:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/vllm-documentation/</guid><description>&lt;h2 id="1-introduction-to-vllm">1. Introduction to vLLM&lt;/h2>
&lt;p>vLLM is an open-source inference and serving engine designed for large language models (LLMs), renowned for its high throughput and memory efficiency. In the field of LLM serving, vLLM addresses a core pain point: traditional inference systems are inefficient when handling the key-value cache (KV Cache) in Transformer models&amp;rsquo; attention mechanism, resulting in significant memory waste and limited inference speed.&lt;/p>
&lt;p>The memory bottleneck in LLM inference primarily stems from the KV Cache. This cache stores attention keys and values for each previous token in a sequence to accelerate the generation of subsequent tokens. However, the size of the KV Cache is dynamic and difficult to predict, creating enormous challenges for memory management. Traditional systems (like HuggingFace Transformers) typically pre-allocate a large continuous memory space to store the KV Cache, leading to severe memory fragmentation and waste.&lt;/p>
&lt;p>vLLM fundamentally solves this problem by introducing its core innovation: the &lt;strong>PagedAttention&lt;/strong> mechanism.&lt;/p>
&lt;h2 id="2-core-features-and-advantages">2. Core Features and Advantages&lt;/h2>
&lt;p>vLLM stands out among numerous LLM inference frameworks thanks to several key features:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Extremely High Throughput&lt;/strong>: Through PagedAttention and Continuous Batching, vLLM significantly improves GPU utilization. Its throughput is several times higher than HuggingFace Transformers and outperforms other mainstream inference libraries.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Management&lt;/strong>: The PagedAttention mechanism divides the KV Cache into non-continuous memory blocks, greatly reducing internal and external memory fragmentation. According to official data, it can save up to 55% of memory, meaning you can load larger models or serve more concurrent requests with the same hardware.&lt;/li>
&lt;li>&lt;strong>Flexible Decoding Strategies&lt;/strong>: vLLM supports various complex decoding algorithms, including Parallel Sampling, Beam Search, and Top-K/Top-P sampling, meeting the needs of different application scenarios.&lt;/li>
&lt;li>&lt;strong>OpenAI API Compatibility&lt;/strong>: vLLM provides a service endpoint that is fully compatible with the OpenAI API. This means you can seamlessly integrate vLLM into existing application ecosystems built on the OpenAI API with just a few configuration changes.&lt;/li>
&lt;li>&lt;strong>Distributed Inference&lt;/strong>: For ultra-large models that cannot fit on a single GPU, vLLM supports Tensor Parallelism, distributing model weights and computational load across multiple GPUs for efficient distributed inference.&lt;/li>
&lt;li>&lt;strong>Streaming and Structured Output&lt;/strong>: Supports streaming of generated tokens and can produce structured outputs in specific formats (such as JSON Schema or regular expressions) through Guided Generation.&lt;/li>
&lt;/ul>
&lt;h2 id="3-core-architecture-deep-dive-into-pagedattention">3. Core Architecture: Deep Dive into PagedAttention&lt;/h2>
&lt;p>PagedAttention is the soul of vLLM, with its design inspiration coming from the paging technique used in modern operating systems to manage virtual memory.&lt;/p>
&lt;h3 id="31-working-principle">3.1 Working Principle&lt;/h3>
&lt;p>In traditional methods, the KV Cache for each sequence is stored in continuous memory space. While this approach seems simple, it leads to severe memory fragmentation due to the vast differences in sequence lengths.&lt;/p>
&lt;p>PagedAttention divides each sequence's KV Cache into fixed-size &lt;strong>blocks&lt;/strong>. Each block can store keys and values for a fixed number of tokens. During inference, vLLM's core scheduler dynamically allocates these blocks to sequences as needed.&lt;/p>
&lt;p>The advantages of this design include:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Eliminating Internal Fragmentation&lt;/strong>: Since blocks are of fixed size, a sequence's last block may have some unused space, but this waste is far less than that caused by reserving continuous memory for the entire sequence.&lt;/li>
&lt;li>&lt;strong>Flexible Memory Allocation&lt;/strong>: Blocks are stored in non-continuous memory space, making memory management more flexible, similar to how operating systems manage physical memory pages.&lt;/li>
&lt;li>&lt;strong>Efficient Memory Sharing&lt;/strong>: PagedAttention makes sharing KV Cache between different sequences exceptionally simple and efficient. For example, in parallel sampling or beam search, multiple candidate sequences originate from the same prompt. vLLM allows these sequences to share KV blocks storing the prompt portion, only needing to allocate new, independent blocks for each sequence when generating new tokens. This &amp;ldquo;Copy-on-Write&amp;rdquo; mechanism greatly reduces the memory overhead of complex decoding algorithms.&lt;/li>
&lt;/ol>
&lt;p>Below is a Mermaid diagram that more intuitively illustrates PagedAttention's memory management approach:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Physical_Memory [KV Cache Physical Memory]
direction LR
B1(Block 1)
B2(Block 2)
B3(Block 3)
B4(Block 4)
B5(Block 5)
B6(Block 6)
B7(Block 7)
B8(Block 8)
end
subgraph Logical_View [Sequence Logical View]
direction TB
subgraph Seq1 [Sequence 1]
P1(Prompt) --&amp;gt; T1(Token 1)
end
subgraph Seq2 [Sequence 2]
P2(Prompt) --&amp;gt; T2(Token 1) --&amp;gt; T3(Token 2)
end
subgraph Seq3 [Parallel Sampling]
P3(Prompt) --&amp;gt; T4(Token 1a)
P3 --&amp;gt; T5(Token 1b)
end
end
subgraph Block_Table [Block Table]
direction TB
Map1[&amp;quot;Seq 1: [B1, B5]&amp;quot;]
Map2[&amp;quot;Seq 2: [B2, B6, B8]&amp;quot;]
Map3[&amp;quot;Seq 3a: [B3, B7]&amp;quot;]
Map4[&amp;quot;Seq 3b: [B3, B4]&amp;quot;]
end
Seq1 --&amp;gt; Map1
Seq2 --&amp;gt; Map2
Seq3 --&amp;gt; Map3
Seq3 --&amp;gt; Map4
Map1 --&amp;gt; B1
Map1 --&amp;gt; B5
Map2 --&amp;gt; B2
Map2 --&amp;gt; B6
Map2 --&amp;gt; B8
Map3 --&amp;gt; B3
Map3 --&amp;gt; B7
Map4 --&amp;gt; B3
Map4 --&amp;gt; B4
style B3 fill:#f9f,stroke:#333,stroke-width:2px
linkStyle 8 stroke-width:2px,stroke:green,fill:none;
linkStyle 11 stroke-width:2px,stroke:green,fill:none;
linkStyle 12 stroke-width:2px,stroke:green,fill:none;
&lt;/code>&lt;/pre>
&lt;p>&lt;em>Diagram explanation:&lt;/em>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>KV Cache Physical Memory&lt;/strong>: Represents non-continuous physical memory blocks on the GPU.&lt;/li>
&lt;li>&lt;strong>Sequence Logical View&lt;/strong>: Represents multiple requests (sequences) being processed.&lt;/li>
&lt;li>&lt;strong>Block Table&lt;/strong>: vLLM's core component that maps logical token positions to physical memory blocks.&lt;/li>
&lt;li>&lt;strong>Memory Sharing&lt;/strong>: Note that the two branches in &amp;ldquo;Parallel Sampling&amp;rdquo; (3a and 3b) share the same Prompt block (B3), demonstrating PagedAttention's efficient memory sharing.&lt;/li>
&lt;/ul>
&lt;h3 id="32-continuous-batching">3.2 Continuous Batching&lt;/h3>
&lt;p>Based on PagedAttention, vLLM implements a more advanced batching strategy—continuous batching. Traditional static batching requires waiting for all sequences in a batch to complete generation before processing the next batch. Continuous batching, however, allows new requests to be inserted into the batch immediately after a sequence in the batch completes generation, avoiding GPU idle waiting and further improving throughput.&lt;/p>
&lt;p>Below is a comparison of the two batching methods using a Mermaid sequence diagram:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant C as Client
participant S as Server
participant G as GPU
note over C, G: --- Static Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process Batch 1 [R1, R2, R3, R4]
note right of G: All requests process in parallel
G--&amp;gt;&amp;gt;S: Batch 1 Finished
note right of S: Wait for the entire batch to complete
S--&amp;gt;&amp;gt;C: Response [O1, O2, O3, O4]
C-&amp;gt;&amp;gt;S: Request [R5, R6]
S-&amp;gt;&amp;gt;G: Process Batch 2 [R5, R6]
note over C, G: --- Continuous Batching ---
C-&amp;gt;&amp;gt;S: Request [R1, R2, R3, R4]
S-&amp;gt;&amp;gt;G: Process [R1, R2, R3, R4]
G--&amp;gt;&amp;gt;S: R2 Finished
S--&amp;gt;&amp;gt;C: Response O2
C-&amp;gt;&amp;gt;S: New Request R5
S-&amp;gt;&amp;gt;G: Add R5 to queue (GPU is not idle)
note right of G: R1, R3, R4, R5 are now processing
G--&amp;gt;&amp;gt;S: R4 Finished
S--&amp;gt;&amp;gt;C: Response O4
&lt;/code>&lt;/pre>
&lt;h2 id="4-quick-start-guide">4. Quick Start Guide&lt;/h2>
&lt;p>Below, we'll demonstrate how to install and use vLLM through a few simple steps.&lt;/p>
&lt;h3 id="41-installation">4.1 Installation&lt;/h3>
&lt;p>You can install vLLM using either &lt;code>pip&lt;/code> or &lt;code>uv&lt;/code> (a faster package installation tool). Using &lt;code>uv&lt;/code> is recommended as it can automatically detect your CUDA version and install the matching PyTorch backend.&lt;/p>
&lt;p>&lt;strong>Using uv (recommended):&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash"># Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install vllm --torch-backend=auto
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using pip:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;h3 id="42-offline-inference">4.2 Offline Inference&lt;/h3>
&lt;p>The &lt;code>vllm.LLM&lt;/code> class makes offline inference very convenient.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
# Define input prompts
prompts = [
&amp;quot;Hello, my name is&amp;quot;,
&amp;quot;The capital of France is&amp;quot;,
&amp;quot;The future of AI is&amp;quot;,
]
# Define sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM engine (model will be automatically downloaded from Hugging Face)
llm = LLM(model=&amp;quot;facebook/opt-125m&amp;quot;)
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f&amp;quot;Prompt: {prompt!r}, Generated text: {generated_text!r}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="43-launching-an-openaicompatible-server">4.3 Launching an OpenAI-Compatible Server&lt;/h3>
&lt;p>One of vLLM's most powerful features is its built-in API server. With just one command, you can start a service compatible with the OpenAI API.&lt;/p>
&lt;pre>&lt;code class="language-bash">vllm serve Qwen/Qwen2.5-1.5B-Instruct
&lt;/code>&lt;/pre>
&lt;p>By default, the server will run on &lt;code>http://localhost:8000&lt;/code>.&lt;/p>
&lt;h3 id="44-interacting-with-the-server">4.4 Interacting with the Server&lt;/h3>
&lt;p>You can interact with the server using &lt;code>curl&lt;/code> or the &lt;code>openai&lt;/code> Python client.&lt;/p>
&lt;p>&lt;strong>Using curl:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7,
&amp;quot;temperature&amp;quot;: 0
}'
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Using the OpenAI Python client:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url=&amp;quot;http://localhost:8000/v1&amp;quot;,
api_key=&amp;quot;not-used&amp;quot; # API key is not required
)
completion = client.chat.completions.create(
model=&amp;quot;Qwen/Qwen2.5-1.5B-Instruct&amp;quot;,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;system&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;You are a helpful assistant.&amp;quot;},
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Who won the world series in 2020?&amp;quot;}
]
)
print(completion.choices[0].message)
&lt;/code>&lt;/pre>
&lt;h2 id="5-model-serving">5. Model Serving&lt;/h2>
&lt;h3 id="51-distributed-serving">5.1 Distributed Serving&lt;/h3>
&lt;p>If a model is too large to fit on a single GPU, you can distribute it across multiple GPUs using tensor parallelism.&lt;/p>
&lt;pre>&lt;code class="language-bash"># Start a service on 4 GPUs
vllm serve facebook/opt-13b --tensor-parallel-size 4
&lt;/code>&lt;/pre>
&lt;h3 id="52-docker-deployment">5.2 Docker Deployment&lt;/h3>
&lt;p>vLLM provides official Docker images for convenient containerized deployment.&lt;/p>
&lt;pre>&lt;code class="language-bash">docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env &amp;quot;HUGGING_FACE_HUB_TOKEN=&amp;lt;your-hf-token&amp;gt;&amp;quot; \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
&lt;/code>&lt;/pre>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-structured-outputs">6.1 Structured Outputs&lt;/h3>
&lt;p>vLLM supports various ways to constrain the model's output format, which is crucial for applications requiring reliable, parsable outputs.&lt;/p>
&lt;p>&lt;strong>Generating JSON using Pydantic models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:8000/v1&amp;quot;, api_key=&amp;quot;dummy&amp;quot;)
model = client.models.list().data[0].id
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;Generate a JSON with the name and age of one random person.&amp;quot;}
],
response_format={
&amp;quot;type&amp;quot;: &amp;quot;json_schema&amp;quot;,
&amp;quot;json_schema&amp;quot;: {
&amp;quot;name&amp;quot;: &amp;quot;people&amp;quot;,
&amp;quot;schema&amp;quot;: People.model_json_schema()
}
},
)
print(completion.choices[0].message.content)
&lt;/code>&lt;/pre>
&lt;h3 id="62-lora-support">6.2 LoRA Support&lt;/h3>
&lt;p>vLLM can efficiently serve multiple LoRA adapters on the same base model. This is particularly useful for scenarios requiring customized models for different customers or tasks.&lt;/p>
&lt;p>&lt;strong>Starting a server with LoRA support:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Specifying a LoRA adapter in a request:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;sql-lora&amp;quot;, # Specify the LoRA model ID
&amp;quot;prompt&amp;quot;: &amp;quot;San Francisco is a&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 7
}'
&lt;/code>&lt;/pre>
&lt;h3 id="63-quantization">6.3 Quantization&lt;/h3>
&lt;p>Quantization is a technique to reduce model size and memory usage by lowering the precision of model weights. vLLM supports various quantization schemes, such as AWQ and FP8 KV cache.&lt;/p>
&lt;p>&lt;strong>Enabling FP8 KV cache:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM
llm = LLM(
model=&amp;quot;meta-llama/Llama-2-7b-chat-hf&amp;quot;,
kv_cache_dtype=&amp;quot;fp8&amp;quot;,
calculate_kv_scales=True # Dynamically calculate quantization scales
)
&lt;/code>&lt;/pre>
&lt;h2 id="7-framework-integration">7. Framework Integration&lt;/h2>
&lt;p>vLLM can be easily integrated with popular LLM application frameworks like Langchain and LlamaIndex for building complex systems such as Retrieval-Augmented Generation (RAG). Typically, vLLM serves as a backend providing fast LLM inference and embedding generation services.&lt;/p>
&lt;p>&lt;strong>Installing related dependencies:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install -U vllm langchain_openai langchain_community
&lt;/code>&lt;/pre>
&lt;p>Afterward, in Langchain, you can point the &lt;code>base_url&lt;/code> of &lt;code>ChatOpenAI&lt;/code> or &lt;code>OpenAIEmbeddings&lt;/code> to your vLLM server's address to complete the integration.&lt;/p>
&lt;h2 id="8-conclusion">8. Conclusion&lt;/h2>
&lt;p>Through its innovative PagedAttention architecture, vLLM successfully addresses memory management and performance bottlenecks in LLM inference, providing developers with an extremely efficient, flexible, and easy-to-use inference serving engine. Whether conducting quick offline experiments or deploying production-grade, high-concurrency LLM services, vLLM demonstrates excellent performance and powerful functionality. As the community continues to develop, vLLM is becoming one of the standard tools in the field of LLM serving.&lt;/p></description></item><item><title>WebRTC Technical Guide: Web-Based Real-Time Communication Framework</title><link>https://ziyanglin.netlify.app/en/post/webrtc-documentation/</link><pubDate>Thu, 26 Jun 2025 01:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/webrtc-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>WebRTC (Web Real-Time Communication) is an open-source technology that enables real-time voice and video communication in web browsers. It allows direct peer-to-peer (P2P) audio, video, and data sharing between browsers without requiring any plugins or third-party software.&lt;/p>
&lt;p>The main goal of WebRTC is to provide high-quality, low-latency real-time communication, making it easy for developers to build rich communication features into web applications.&lt;/p>
&lt;h3 id="core-advantages">Core Advantages&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Cross-platform and browser compatibility&lt;/strong>: WebRTC is an open standard by W3C and IETF, widely supported by major browsers (Chrome, Firefox, Safari, Edge).&lt;/li>
&lt;li>&lt;strong>No plugins required&lt;/strong>: Users can use real-time communication features directly in their browsers without downloading or installing any extensions.&lt;/li>
&lt;li>&lt;strong>Peer-to-peer communication&lt;/strong>: When possible, data is transmitted directly between users, reducing server bandwidth pressure and latency.&lt;/li>
&lt;li>&lt;strong>High security&lt;/strong>: All WebRTC communications are mandatorily encrypted (via SRTP and DTLS), ensuring data confidentiality and integrity.&lt;/li>
&lt;li>&lt;strong>High-quality audio and video&lt;/strong>: WebRTC includes advanced signal processing components like echo cancellation, noise suppression, and automatic gain control to provide excellent audio/video quality.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;p>WebRTC consists of several key JavaScript APIs that work together to enable real-time communication.&lt;/p>
&lt;h3 id="21-rtcpeerconnection">2.1. &lt;code>RTCPeerConnection&lt;/code>&lt;/h3>
&lt;p>&lt;code>RTCPeerConnection&lt;/code> is the core interface of WebRTC, responsible for establishing and managing connections between two peers. Its main responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Media negotiation&lt;/strong>: Handling parameters for audio/video codecs, resolution, etc.&lt;/li>
&lt;li>&lt;strong>Network path discovery&lt;/strong>: Finding the best connection path through the ICE framework.&lt;/li>
&lt;li>&lt;strong>Connection maintenance&lt;/strong>: Managing the connection lifecycle, including establishment, maintenance, and closure.&lt;/li>
&lt;li>&lt;strong>Data transmission&lt;/strong>: Handling the actual transmission of audio/video streams (SRTP) and data channels (SCTP/DTLS).&lt;/li>
&lt;/ul>
&lt;p>An &lt;code>RTCPeerConnection&lt;/code> object represents a WebRTC connection from the local computer to a remote peer.&lt;/p>
&lt;h3 id="22-mediastream">2.2. &lt;code>MediaStream&lt;/code>&lt;/h3>
&lt;p>The &lt;code>MediaStream&lt;/code> API represents streams of media content. A &lt;code>MediaStream&lt;/code> object can contain one or more media tracks (&lt;code>MediaStreamTrack&lt;/code>), which can be:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Audio tracks (&lt;code>AudioTrack&lt;/code>)&lt;/strong>: Audio data from a microphone.&lt;/li>
&lt;li>&lt;strong>Video tracks (&lt;code>VideoTrack&lt;/code>)&lt;/strong>: Video data from a camera.&lt;/li>
&lt;/ul>
&lt;p>Developers typically use the &lt;code>navigator.mediaDevices.getUserMedia()&lt;/code> method to obtain a local &lt;code>MediaStream&lt;/code>, which prompts the user to authorize access to their camera and microphone. The obtained stream can then be added to an &lt;code>RTCPeerConnection&lt;/code> for transmission to the remote peer.&lt;/p>
&lt;h3 id="23-rtcdatachannel">2.3. &lt;code>RTCDataChannel&lt;/code>&lt;/h3>
&lt;p>In addition to audio and video, WebRTC supports the transmission of arbitrary binary data between peers through the &lt;code>RTCDataChannel&lt;/code> API. This provides powerful functionality for:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>File sharing&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Real-time text chat&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Online game state synchronization&lt;/strong>&lt;/li>
&lt;li>&lt;strong>Remote desktop control&lt;/strong>&lt;/li>
&lt;/ul>
&lt;p>The &lt;code>RTCDataChannel&lt;/code> API is designed similarly to WebSockets, offering reliable and unreliable, ordered and unordered transmission modes that developers can choose based on application requirements. It uses the SCTP protocol (Stream Control Transmission Protocol) for transmission and is encrypted via DTLS.&lt;/p>
&lt;h2 id="3-connection-process-in-detail">3. Connection Process in Detail&lt;/h2>
&lt;p>Establishing a WebRTC connection is a complex multi-stage process involving signaling, session description, and network path discovery.&lt;/p>
&lt;h3 id="31-signaling">3.1. Signaling&lt;/h3>
&lt;p>Interestingly, the WebRTC API itself does not include a signaling mechanism. Signaling is the process of exchanging metadata between peers before establishing communication. Developers must choose or implement their own signaling channel. Common technologies include WebSocket or XMLHttpRequest.&lt;/p>
&lt;p>The signaling server acts as an intermediary, helping two clients who want to communicate exchange three types of information:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Session control messages&lt;/strong>: Used to open or close communication.&lt;/li>
&lt;li>&lt;strong>Network configuration&lt;/strong>: Information about the client's IP address and port.&lt;/li>
&lt;li>&lt;strong>Media capabilities&lt;/strong>: Codecs and resolutions supported by the client.&lt;/li>
&lt;/ol>
&lt;p>This process typically follows these steps:&lt;/p>
&lt;ol>
&lt;li>Client A sends a &amp;ldquo;request call&amp;rdquo; message to the signaling server.&lt;/li>
&lt;li>The signaling server forwards this request to client B.&lt;/li>
&lt;li>Client B agrees to the call.&lt;/li>
&lt;li>Afterward, clients A and B exchange SDP and ICE candidates through the signaling server until they find a viable connection path.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant ClientA as Client A
participant SignalingServer as Signaling Server
participant ClientB as Client B
ClientA-&amp;gt;&amp;gt;SignalingServer: Initiate call request (join room)
SignalingServer-&amp;gt;&amp;gt;ClientB: Forward call request
ClientB--&amp;gt;&amp;gt;SignalingServer: Accept call
SignalingServer--&amp;gt;&amp;gt;ClientA: B has joined
loop Offer/Answer &amp;amp; ICE Exchange
ClientA-&amp;gt;&amp;gt;SignalingServer: Send SDP Offer / ICE Candidate
SignalingServer-&amp;gt;&amp;gt;ClientB: Forward SDP Offer / ICE Candidate
ClientB-&amp;gt;&amp;gt;SignalingServer: Send SDP Answer / ICE Candidate
SignalingServer-&amp;gt;&amp;gt;ClientA: Forward SDP Answer / ICE Candidate
end
&lt;/code>&lt;/pre>
&lt;h3 id="32-session-description-protocol-sdp">3.2. Session Description Protocol (SDP)&lt;/h3>
&lt;p>SDP (Session Description Protocol) is a standard format for describing multimedia connection content. It doesn't transmit media data itself but describes the connection parameters. An SDP object includes:&lt;/p>
&lt;ul>
&lt;li>Session unique identifier and version.&lt;/li>
&lt;li>Media types (audio, video, data).&lt;/li>
&lt;li>Codecs used (e.g., VP8, H.264, Opus).&lt;/li>
&lt;li>Network transport information (IP addresses and ports).&lt;/li>
&lt;li>Bandwidth information.&lt;/li>
&lt;/ul>
&lt;p>WebRTC uses the &lt;strong>Offer/Answer model&lt;/strong> to exchange SDP information:&lt;/p>
&lt;ol>
&lt;li>The &lt;strong>Caller&lt;/strong> creates an &lt;strong>Offer&lt;/strong> SDP describing the communication parameters it desires and sends it to the receiver through the signaling server.&lt;/li>
&lt;li>The &lt;strong>Callee&lt;/strong> receives the Offer and creates an &lt;strong>Answer&lt;/strong> SDP describing the communication parameters it can support, sending it back to the caller through the signaling server.&lt;/li>
&lt;li>Once both parties accept each other's SDP, they have reached a consensus on the session parameters.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">sequenceDiagram
participant Caller
participant SignalingServer as Signaling Server
participant Callee
Caller-&amp;gt;&amp;gt;Caller: createOffer()
Caller-&amp;gt;&amp;gt;Caller: setLocalDescription(offer)
Caller-&amp;gt;&amp;gt;SignalingServer: Send Offer
SignalingServer-&amp;gt;&amp;gt;Callee: Forward Offer
Callee-&amp;gt;&amp;gt;Callee: setRemoteDescription(offer)
Callee-&amp;gt;&amp;gt;Callee: createAnswer()
Callee-&amp;gt;&amp;gt;Callee: setLocalDescription(answer)
Callee-&amp;gt;&amp;gt;SignalingServer: Send Answer
SignalingServer-&amp;gt;&amp;gt;Caller: Forward Answer
Caller-&amp;gt;&amp;gt;Caller: setRemoteDescription(answer)
&lt;/code>&lt;/pre>
&lt;h3 id="33-interactive-connectivity-establishment-ice">3.3. Interactive Connectivity Establishment (ICE)&lt;/h3>
&lt;p>Since most devices are behind NAT (Network Address Translation) or firewalls and don't have public IP addresses, establishing direct P2P connections becomes challenging. ICE (Interactive Connectivity Establishment) is a framework specifically designed to solve this problem.&lt;/p>
&lt;p>The ICE workflow is as follows:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Gather candidate addresses&lt;/strong>: Each client collects its network address candidates from different sources:
&lt;ul>
&lt;li>&lt;strong>Local addresses&lt;/strong>: The device's IP address within the local network.&lt;/li>
&lt;li>&lt;strong>Server Reflexive Address&lt;/strong>: The device's public IP address and port discovered through a STUN server.&lt;/li>
&lt;li>&lt;strong>Relayed Address&lt;/strong>: A relay address obtained through a TURN server. When P2P direct connection fails, all data will be forwarded through the TURN server.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Exchange candidates&lt;/strong>: Clients exchange their collected ICE candidate lists through the signaling server.&lt;/li>
&lt;li>&lt;strong>Connectivity checks&lt;/strong>: Clients pair up the received candidate addresses and send STUN requests for connectivity checks (called &amp;ldquo;pings&amp;rdquo;) to determine which paths are available.&lt;/li>
&lt;li>&lt;strong>Select the best path&lt;/strong>: Once a viable address pair is found, the ICE agent selects it as the communication path and begins transmitting media data. P2P direct connection paths are typically prioritized because they have the lowest latency.&lt;/li>
&lt;/ol>
&lt;pre>&lt;code class="language-mermaid">graph TD
subgraph Client A
A1(Start) --&amp;gt; A2{Gather Candidates};
A2 --&amp;gt; A3[Local Address];
A2 --&amp;gt; A4[STUN Address];
A2 --&amp;gt; A5[TURN Address];
end
subgraph Client B
B1(Start) --&amp;gt; B2{Gather Candidates};
B2 --&amp;gt; B3[Local Address];
B2 --&amp;gt; B4[STUN Address];
B2 --&amp;gt; B5[TURN Address];
end
A2 --&amp;gt; C1((Signaling Server));
B2 --&amp;gt; C1;
C1 --&amp;gt; A6(Exchange Candidates);
C1 --&amp;gt; B6(Exchange Candidates);
A6 --&amp;gt; A7{Connectivity Checks};
B6 --&amp;gt; B7{Connectivity Checks};
A7 -- STUN Request --&amp;gt; B7;
B7 -- STUN Response --&amp;gt; A7;
A7 --&amp;gt; A8(Select Best Path);
B7 --&amp;gt; B8(Select Best Path);
A8 --&amp;gt; A9((P2P Connection Established));
B8 --&amp;gt; B9((P2P Connection Established));
&lt;/code>&lt;/pre>
&lt;h2 id="4-nat-traversal-stun-and-turn">4. NAT Traversal: STUN and TURN&lt;/h2>
&lt;p>To achieve P2P connections, WebRTC heavily relies on STUN and TURN servers to solve NAT-related issues.&lt;/p>
&lt;h3 id="41-stun-servers">4.1. STUN Servers&lt;/h3>
&lt;p>STUN (Session Traversal Utilities for NAT) servers are very lightweight, with a simple task: telling a client behind NAT what its public IP address and port are.&lt;/p>
&lt;p>When a WebRTC client sends a request to a STUN server, the server checks the source IP and port of the request and returns them to the client. This way, the client knows &amp;ldquo;what it looks like on the internet&amp;rdquo; and can share this public address as an ICE candidate with other peers.&lt;/p>
&lt;p>Using STUN servers is the preferred approach for establishing P2P connections because they are only needed during the connection establishment phase and don't participate in actual data transmission, resulting in minimal overhead.&lt;/p>
&lt;h3 id="42-turn-servers">4.2. TURN Servers&lt;/h3>
&lt;p>However, in some complex network environments (such as symmetric NAT), peers cannot establish direct connections even if they know their public addresses. This is where TURN (Traversal Using Relays around NAT) servers come in.&lt;/p>
&lt;p>A TURN server is a more powerful relay server. When P2P connection fails, both clients connect to the TURN server, which then forwards all audio, video, and data between them. This is no longer true P2P communication, but it ensures that connections can still be established under the worst network conditions.&lt;/p>
&lt;p>Using TURN servers increases latency and server bandwidth costs, so they are typically used as a last resort.&lt;/p>
&lt;h2 id="5-security">5. Security&lt;/h2>
&lt;p>Security is a core principle in WebRTC design, with all communications mandatorily encrypted and unable to be disabled.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Signaling security&lt;/strong>: The WebRTC standard doesn't specify a signaling protocol but recommends using secure WebSocket (WSS) or HTTPS to encrypt signaling messages.&lt;/li>
&lt;li>&lt;strong>Media encryption&lt;/strong>: All audio/video streams use &lt;strong>SRTP (Secure Real-time Transport Protocol)&lt;/strong> for encryption. SRTP prevents eavesdropping and content tampering by encrypting and authenticating RTP packets.&lt;/li>
&lt;li>&lt;strong>Data encryption&lt;/strong>: All &lt;code>RTCDataChannel&lt;/code> data is encrypted using &lt;strong>DTLS (Datagram Transport Layer Security)&lt;/strong>. DTLS is a protocol based on TLS that provides the same security guarantees for datagrams.&lt;/li>
&lt;/ul>
&lt;p>Key exchange is automatically completed during the &lt;code>RTCPeerConnection&lt;/code> establishment process through the DTLS handshake. This means a secure channel is established before any media or data exchange occurs.&lt;/p>
&lt;h2 id="6-practical-application-cases">6. Practical Application Cases&lt;/h2>
&lt;p>With its powerful features, WebRTC has been widely applied in various scenarios:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Video conferencing systems&lt;/strong>: Such as Google Meet, Jitsi Meet, etc., allowing multi-party real-time audio/video calls.&lt;/li>
&lt;li>&lt;strong>Online education platforms&lt;/strong>: Enabling remote interactive teaching between teachers and students.&lt;/li>
&lt;li>&lt;strong>Telemedicine&lt;/strong>: Allowing doctors to conduct video consultations with patients remotely.&lt;/li>
&lt;li>&lt;strong>P2P file sharing&lt;/strong>: Using &lt;code>RTCDataChannel&lt;/code> for fast file transfers between browsers.&lt;/li>
&lt;li>&lt;strong>Cloud gaming and real-time games&lt;/strong>: Providing low-latency instruction and data synchronization for games.&lt;/li>
&lt;li>&lt;strong>Online customer service and video support&lt;/strong>: Businesses providing real-time video support services to customers through web pages.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>WebRTC is a revolutionary technology that brings real-time communication capabilities directly into browsers, greatly lowering the barrier to developing rich media applications. Through the three core APIs of &lt;code>RTCPeerConnection&lt;/code>, &lt;code>MediaStream&lt;/code>, and &lt;code>RTCDataChannel&lt;/code>, combined with powerful signaling, ICE, and security mechanisms, WebRTC provides a complete, robust, and secure real-time communication solution.&lt;/p>
&lt;p>As network technology develops and 5G becomes more widespread, WebRTC's application scenarios will become even broader, with its potential in emerging fields such as IoT, augmented reality (AR), and virtual reality (VR) gradually becoming apparent. For developers looking to integrate high-quality, low-latency communication features into their applications, WebRTC is undoubtedly one of the most worthwhile technologies to focus on and learn about today.&lt;/p></description></item><item><title>LoRA Technical Guide: Parameter-Efficient Fine-Tuning for Large Models</title><link>https://ziyanglin.netlify.app/en/post/lora-documentation/</link><pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/lora-documentation/</guid><description>&lt;h2 id="1-introduction-why-lora">1. Introduction: Why LoRA?&lt;/h2>
&lt;p>In today's rapidly evolving landscape of Large Language Models (LLMs) and generative AI, we've witnessed an explosive growth in model sizes, ranging from hundreds of millions to trillions of parameters. These massive models demonstrate remarkable capabilities across various tasks. However, a significant challenge emerges: how can we fine-tune these models for specific downstream tasks?&lt;/p>
&lt;p>The traditional &lt;strong>Full Fine-Tuning&lt;/strong> approach, which updates all parameters of a model, faces severe challenges:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High computational cost&lt;/strong>: Fine-tuning a model with billions of parameters requires enormous computational resources and hundreds of GB of GPU memory, which is prohibitively expensive for most developers and small to medium-sized enterprises.&lt;/li>
&lt;li>&lt;strong>Massive storage requirements&lt;/strong>: Each fine-tuned model for a specific task requires storing a complete model copy, leading to rapidly escalating storage costs.&lt;/li>
&lt;li>&lt;strong>Deployment difficulties&lt;/strong>: Maintaining and switching between multiple massive model copies for different tasks in a production environment is a nightmare.&lt;/li>
&lt;/ul>
&lt;p>To address these pain points, &lt;strong>Parameter-Efficient Fine-Tuning (PEFT)&lt;/strong> techniques have emerged. The core idea is to freeze most parameters of the pre-trained model during fine-tuning and only adjust a small portion (typically far less than 1% of the total) of new or specific parameters.&lt;/p>
&lt;p>Among the various PEFT techniques, &lt;strong>LoRA (Low-Rank Adaptation of Large Language Models)&lt;/strong> stands out for its excellent performance, efficiency, and implementation simplicity, becoming one of the most mainstream and widely applied solutions today. This document will provide an in-depth yet accessible introduction to the core principles of LoRA and offer detailed practical guidance.&lt;/p>
&lt;h2 id="2-core-principles-the-magic-of-lora">2. Core Principles: The Magic of LoRA&lt;/h2>
&lt;p>LoRA's core assumption is that &lt;strong>the weight changes in large language models when adapting to new tasks are low-rank&lt;/strong>. In other words, although the weight matrix &lt;code>W&lt;/code> of the pre-trained model is very large (e.g., &lt;code>d x d&lt;/code> dimensions), the weight change &lt;code>ΔW&lt;/code> during fine-tuning has a very low &amp;ldquo;intrinsic rank.&amp;rdquo;&lt;/p>
&lt;p>Based on this assumption, LoRA doesn't directly update &lt;code>W&lt;/code>, but instead approximates &lt;code>ΔW&lt;/code> by training two smaller, low-rank matrices &lt;code>B&lt;/code> and &lt;code>A&lt;/code>, such that &lt;code>ΔW ≈ BA&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>&lt;code>W&lt;/code> is the pre-trained, frozen weight matrix.&lt;/li>
&lt;li>&lt;code>A&lt;/code> is an &lt;code>r x d&lt;/code> dimensional matrix, where &lt;code>r&lt;/code> is a rank much smaller than &lt;code>d&lt;/code>.&lt;/li>
&lt;li>&lt;code>B&lt;/code> is a &lt;code>d x r&lt;/code> dimensional matrix.&lt;/li>
&lt;/ul>
&lt;p>During fine-tuning, only the parameters of matrices &lt;code>A&lt;/code> and &lt;code>B&lt;/code> are trainable. The forward propagation computation process is accordingly changed to:&lt;/p>
&lt;p>&lt;code>h = Wx + BAx&lt;/code>&lt;/p>
&lt;p>Here's a diagram that illustrates this process more intuitively:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Input x] --&amp;gt; B(Pre-trained weights W);
A --&amp;gt; C(Low-rank matrix A);
C --&amp;gt; D(Low-rank matrix B);
B --&amp;gt; E[Wx];
D --&amp;gt; F[BAx];
E --&amp;gt; G((Sum));
F --&amp;gt; G;
G --&amp;gt; H[Final output h];
style B fill:#eee,stroke:#333,stroke-width:2px,stroke-dasharray: 5, 5
style C fill:#9cf,stroke:#333,stroke-width:2px
style D fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>Where &lt;code>x&lt;/code> is the input and &lt;code>h&lt;/code> is the output. This approach greatly reduces the number of parameters that need to be trained. For example, if &lt;code>d = 4096&lt;/code> and &lt;code>r = 8&lt;/code>, the original matrix &lt;code>W&lt;/code> has &lt;code>4096 * 4096 ≈ 16.7M&lt;/code> parameters, while &lt;code>A&lt;/code> and &lt;code>B&lt;/code> together have only &lt;code>4096 * 8 + 8 * 4096 ≈ 65K&lt;/code> parameters, reducing the parameter count by approximately 256 times!&lt;/p>
&lt;p>&lt;strong>Key parameter &lt;code>r&lt;/code>&lt;/strong>: The rank &lt;code>r&lt;/code> is the most important hyperparameter in LoRA. It controls the size of the low-rank matrices and directly determines the number of new parameters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Smaller &lt;code>r&lt;/code>&lt;/strong>: Fewer trainable parameters, faster training speed, lower memory usage, but may not fully capture complex features of the task.&lt;/li>
&lt;li>&lt;strong>Larger &lt;code>r&lt;/code>&lt;/strong>: More trainable parameters, stronger model fitting capability, but increases computational cost and risk of overfitting.
In practice, &lt;code>r&lt;/code> is typically set to 8, 16, 32, or 64, which achieves a good balance between performance and efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="3-significant-advantages-of-lora">3. Significant Advantages of LoRA&lt;/h2>
&lt;p>Compared to full fine-tuning, LoRA demonstrates overwhelming advantages in multiple aspects:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Extreme parameter efficiency&lt;/strong>: As mentioned above, LoRA only requires training a tiny fraction of parameters. We can see this intuitively through the &lt;code>print_trainable_parameters()&lt;/code> function, where the proportion of trained parameters is typically less than 1%.&lt;/li>
&lt;li>&lt;strong>Faster training speed&lt;/strong>: With a significantly reduced number of parameters for gradient computation and updates, training time is also shortened, accelerating the iteration cycle.&lt;/li>
&lt;li>&lt;strong>Lower hardware requirements&lt;/strong>: LoRA significantly reduces GPU memory (VRAM) usage during training, making it possible to fine-tune models with tens of billions of parameters on consumer-grade GPUs (such as RTX 3090/4090).&lt;/li>
&lt;li>&lt;strong>Flexibility in deployment and management&lt;/strong>: This is one of LoRA's most attractive advantages. The pre-trained model remains unchanged and can be shared across all tasks. For each downstream task, we only need to save a lightweight (typically just a few MB to tens of MB) LoRA adapter (i.e., the weights of matrices A and B). During deployment, the appropriate adapter can be loaded dynamically according to needs, greatly simplifying model management and switching in multi-task scenarios.&lt;/li>
&lt;/ol>
&lt;h2 id="4-handson-practice-lora-training-methods">4. Hands-on Practice: LoRA Training Methods&lt;/h2>
&lt;p>Below, we'll demonstrate a complete example of how to fine-tune a large model using LoRA with the &lt;code>transformers&lt;/code>, &lt;code>peft&lt;/code>, and &lt;code>trl&lt;/code> libraries from the Hugging Face ecosystem.&lt;/p>
&lt;h3 id="step-1-environment-preparation">Step 1: Environment Preparation&lt;/h3>
&lt;p>First, ensure you have installed the necessary Python libraries:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install transformers peft trl datasets torch
&lt;/code>&lt;/pre>
&lt;h3 id="step-2-load-model-tokenizer-and-dataset">Step 2: Load Model, Tokenizer, and Dataset&lt;/h3>
&lt;p>We select a pre-trained model as the foundation and load the corresponding tokenizer. At the same time, we load a dataset from the Hugging Face Hub for fine-tuning.&lt;/p>
&lt;pre>&lt;code class="language-python">from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
# Model ID, can be any supported Causal LM
model_id = &amp;quot;facebook/opt-350m&amp;quot;
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_id)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load dataset (using English quotes dataset as an example)
dataset = load_dataset(&amp;quot;Abirate/english_quotes&amp;quot;, split=&amp;quot;train&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="step-3-configure-lora-loraconfig">Step 3: Configure LoRA (&lt;code>LoraConfig&lt;/code>)&lt;/h3>
&lt;p>This is the core step of LoRA fine-tuning. We need to create a &lt;code>LoraConfig&lt;/code> object to define the behavior of the LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import LoraConfig
lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices, recommended values are 8, 16, 32
lora_alpha=32, # Scaling factor, typically set to twice the value of r
target_modules=[&amp;quot;q_proj&amp;quot;, &amp;quot;v_proj&amp;quot;], # Specify which model layers to apply LoRA to. For Transformer models, typically q_proj and v_proj
lora_dropout=0.05, # Dropout probability for LoRA layers
bias=&amp;quot;none&amp;quot;, # Whether to train bias terms, &amp;quot;none&amp;quot; means not training
task_type=&amp;quot;CAUSAL_LM&amp;quot; # Task type, here it's causal language modeling
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>target_modules&lt;/code>: This parameter is crucial. It tells the PEFT library which modules (typically &lt;code>nn.Linear&lt;/code> layers) in the model should have LoRA applied. For most Transformer models, applying it to the query and value projection layers in the Attention mechanism (i.e., &lt;code>q_proj&lt;/code> and &lt;code>v_proj&lt;/code>) is a common practice. You can print the &lt;code>model&lt;/code> object to see the names of all its modules to determine which can be targeted.&lt;/li>
&lt;/ul>
&lt;h3 id="step-4-apply-lora-and-train-with-sfttrainer">Step 4: Apply LoRA and Train with &lt;code>SFTTrainer&lt;/code>&lt;/h3>
&lt;p>The &lt;code>SFTTrainer&lt;/code> (Supervised Fine-tuning Trainer) provided by the &lt;code>trl&lt;/code> library greatly simplifies the fine-tuning process. It has built-in support for &lt;code>peft&lt;/code>, so we just need to pass the model, tokenizer, dataset, and &lt;code>peft_config&lt;/code> to it.&lt;/p>
&lt;pre>&lt;code class="language-python">from trl import SFTTrainer
# Define training parameters
training_args = TrainingArguments(
output_dir=&amp;quot;./lora_finetuned_model&amp;quot;, # Model output directory
num_train_epochs=3, # Number of training epochs
per_device_train_batch_size=4, # Training batch size per device
logging_dir='./logs', # Logging directory
logging_steps=50, # Log every this many steps
learning_rate=2e-4, # Learning rate
)
# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
peft_config=lora_config, # Pass in LoRA configuration
dataset_text_field=&amp;quot;quote&amp;quot;, # Field name containing text in the dataset
)
# Start training
trainer.train()
# Save the trained LoRA adapter
trainer.save_model()
&lt;/code>&lt;/pre>
&lt;p>After training is complete, an &lt;code>adapter_model.bin&lt;/code> file and an &lt;code>adapter_config.json&lt;/code> file will be generated in the &lt;code>output_dir&lt;/code> directory. These are the lightweight LoRA adapter we've trained.&lt;/p>
&lt;h3 id="step-5-inference-with-the-trained-lora-adapter">Step 5: Inference with the Trained LoRA Adapter&lt;/h3>
&lt;p>For inference, we first load the original pre-trained model, then load the trained LoRA adapter weights.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
# Load the original, non-fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained(model_id)
# Load the LoRA adapter
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Now model_with_lora is a model with LoRA weights integrated, ready for inference
prompt = &amp;quot;The best way to predict the future is to&amp;quot;
inputs = tokenizer(prompt, return_tensors=&amp;quot;pt&amp;quot;)
# Generate text
outputs = model_with_lora.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
&lt;/code>&lt;/pre>
&lt;h2 id="5-lora-model-deployment-from-static-to-dynamic">5. LoRA Model Deployment: From Static to Dynamic&lt;/h2>
&lt;p>After training, efficiently deploying LoRA models into production environments is the crucial next step. LoRA deployment strategies mainly fall into two categories: &lt;strong>Weight Merging (Static Deployment)&lt;/strong> and &lt;strong>Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>. The following flowcharts illustrate these two paths:&lt;/p>
&lt;p>&lt;strong>Option 1: Weight Merging (Static Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[Base Model + LoRA Adapter];
B --&amp;gt; C[&amp;quot;Call merge_and_unload()&amp;quot;];
C --&amp;gt; D[Generate standalone full model];
D --&amp;gt; E[Standard deployment];
style D fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Option 2: Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[vLLM / TGI server];
B --&amp;gt; C[Load Base Model];
C --&amp;gt; D[Load multiple LoRA Adapters];
D --&amp;gt; E[On-demand inference combinations];
style E fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="option-1-weight-merging-and-standard-deployment-static">Option 1: Weight Merging and Standard Deployment (Static)&lt;/h3>
&lt;p>This is the simplest and most direct deployment approach. The core idea is to merge the lightweight LoRA adapter weights into the original base model weights, generating a new, standalone full model.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:
Using the &lt;code>merge_and_unload()&lt;/code> method from the &lt;code>peft&lt;/code> library, this process can be easily completed.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assuming model_id and lora_path are defined
base_model = AutoModelForCausalLM.from_pretrained(model_id)
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Merge weights
merged_model = model_with_lora.merge_and_unload()
# Now merged_model is a standard Transformers model
# You can save it like any other model
merged_model.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
tokenizer.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>Afterward, you can load and use this &lt;code>merged_lora_model&lt;/code> just like any regular Hugging Face model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Zero inference latency&lt;/strong>: After merging, the inference process is identical to a standard model, with no additional computational overhead.&lt;/li>
&lt;li>&lt;strong>Simple deployment&lt;/strong>: No need for any additional inference framework support, can be used directly with standard libraries like &lt;code>transformers&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Loss of flexibility&lt;/strong>: For each LoRA adapter, you need to save and load a complete model copy, defeating the lightweight purpose of LoRA.&lt;/li>
&lt;li>&lt;strong>High storage cost&lt;/strong>: If you have multiple adapters, the storage overhead is enormous.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-2-highperformance-dynamic-deployment-with-vllm-recommended">Option 2: High-Performance Dynamic Deployment with vLLM (Recommended)&lt;/h3>
&lt;p>For scenarios requiring simultaneous service of multiple LoRA adapters, &lt;strong>vLLM&lt;/strong> is currently the industry-leading high-performance inference and serving engine. Through core technologies such as &lt;strong>PagedAttention&lt;/strong>, it achieves efficient management and dynamic loading of multiple LoRA adapters, delivering extremely high throughput without significantly sacrificing performance.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Install vLLM&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Start vLLM server&lt;/strong>:
Use the &lt;code>vllm serve&lt;/code> command to start an OpenAI-compatible API server. The key is to enable LoRA support with &lt;code>--enable-lora&lt;/code> and optionally preload adapters with &lt;code>--lora-modules&lt;/code>.&lt;/p>
&lt;pre>&lt;code class="language-bash"># lora_path points to your trained adapter directory
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules my_sql_lora=/path/to/your/sql_lora_adapter
&lt;/code>&lt;/pre>
&lt;p>Here, we've preloaded an adapter named &lt;code>my_sql_lora&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Send inference requests&lt;/strong>:
You can send requests to the vLLM server using &lt;code>curl&lt;/code> or any HTTP client. Just specify the &lt;code>model&lt;/code> in the request body as the name of your loaded LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;my_sql_lora&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Write a SQL query for all users.&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 64
}'
&lt;/code>&lt;/pre>
&lt;p>vLLM will automatically route the request to the corresponding LoRA adapter for inference.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Using Python Client&lt;/strong>:
vLLM also provides a Python API for direct calls in code.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
# Initialize LLM engine with LoRA support
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
sampling_params = SamplingParams(max_tokens=64)
# In the generate call, specify which adapter to use via lora_request
outputs = llm.generate(
&amp;quot;Write a SQL query for all users.&amp;quot;,
sampling_params,
lora_request=LoRARequest(&amp;quot;my_sql_lora&amp;quot;, 1, &amp;quot;/path/to/your/sql_lora_adapter&amp;quot;)
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Extremely high throughput&lt;/strong>: Designed for large-scale concurrent inference.&lt;/li>
&lt;li>&lt;strong>Dynamic flexibility&lt;/strong>: Can simultaneously serve hundreds or thousands of LoRA adapters, loading them on demand, perfect for multi-tenant scenarios.&lt;/li>
&lt;li>&lt;strong>Memory efficient&lt;/strong>: PagedAttention mechanism effectively manages GPU memory, avoiding waste.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Slightly more complex deployment&lt;/strong>: Requires additional learning and configuration of vLLM service.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-3-other-dynamic-deployment-options-eg-tgi">Option 3: Other Dynamic Deployment Options (e.g., TGI)&lt;/h3>
&lt;p>Hugging Face's own &lt;strong>Text Generation Inference (TGI)&lt;/strong> is another powerful production-grade inference server. Similar to vLLM, TGI also supports loading multiple LoRA adapters at startup and dynamically applying them based on incoming request headers. It integrates best with the Hugging Face ecosystem and is a strong competitor to vLLM.&lt;/p>
&lt;h3 id="deployment-options-comparison-summary">Deployment Options Comparison Summary&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">Weight Merging (Static)&lt;/th>
&lt;th align="left">vLLM (Dynamic)&lt;/th>
&lt;th align="left">TGI (Dynamic)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Performance/Throughput&lt;/strong>&lt;/td>
&lt;td align="left">Highest (lowest single request latency)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Flexibility&lt;/strong>&lt;/td>
&lt;td align="left">Low (no dynamic capability)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Deployment Complexity&lt;/strong>&lt;/td>
&lt;td align="left">Low&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Memory Usage&lt;/strong>&lt;/td>
&lt;td align="left">Very High (N adapters = N times memory)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Suitable Scenarios&lt;/strong>&lt;/td>
&lt;td align="left">Single, fixed tasks&lt;/td>
&lt;td align="left">Multi-tenant, high-concurrency, multi-task scenarios&lt;/td>
&lt;td align="left">Production deployment in Hugging Face ecosystem&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-advanced-topics">6. Advanced Topics&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Multi-adapter Management&lt;/strong>: PEFT supports dynamically adding, switching, and disabling multiple adapters on a single model using methods like &lt;code>model.add_adapter()&lt;/code> and &lt;code>model.set_adapter()&lt;/code>, providing great convenience for building flexible multi-task systems.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>As a revolutionary parameter-efficient fine-tuning technique, LoRA successfully addresses the high cost challenges of fine-tuning in the era of large models. Through clever low-rank decomposition ideas, it greatly reduces computational resource and storage requirements while maintaining fine-tuning effectiveness. Combined with advanced inference engines like vLLM, LoRA deployment and service have become unprecedentedly efficient and flexible, driving the application of large models in more specific scenarios.&lt;/p></description></item><item><title>2020 Assessing the Funniness of Edited News Headlines</title><link>https://ziyanglin.netlify.app/en/project/my2020_nlp_funniness_estimation/</link><pubDate>Mon, 27 Jul 2020 04:56:23 +0100</pubDate><guid>https://ziyanglin.netlify.app/en/project/my2020_nlp_funniness_estimation/</guid><description>&lt;p>This project aims to develop potential solutions for the tasks rised by the competition
&lt;a href="https://competitions.codalab.org/competitions/20970#learn_the_details" title="competition">&lt;code>Assessing the Funniness of Edited News Headlines (SemEval-2020)&lt;/code>&lt;/a> on the platform &lt;a href="https://competitions.codalab.org" title="competition">CodaLab&lt;/a>&lt;/p>
&lt;p>As of July 26, 2020, the test result of my trained model (&amp;lsquo;bert-base-uncased&amp;rsquo; from the &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>) &lt;code>ranked third&lt;/code> in the ranking of
Post Evaluation Task 1 on the CodaLab&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_ranking.png" width="600" />
&lt;/p>
&lt;hr>
&lt;h2 id="contents">Contents&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="#tasks-description">Tasks Description&lt;/a>&lt;/li>
&lt;li>&lt;a href="#data-preprocessing">Data Preprocessing&lt;/a>&lt;/li>
&lt;li>&lt;a href="#models-choices--design">Models Choices &amp;amp; Design&lt;/a>&lt;/li>
&lt;li>&lt;a href="#design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/a>&lt;/li>
&lt;li>&lt;a href="#optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prime-hyperparameters">Prime Hyperparameters&lt;/a>&lt;/li>
&lt;li>&lt;a href="#results">Results&lt;/a>&lt;/li>
&lt;li>&lt;a href="#discussion">Discussion&lt;/a>&lt;/li>
&lt;li>&lt;a href="#prospective">Prospective&lt;/a>&lt;/li>
&lt;li>&lt;a href="#License">License&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="tasks-description">Tasks Description&lt;/h2>
&lt;ul>
&lt;li>&lt;code>Task one&lt;/code> - Given one edited headline, design a regression model to predict how funny it is&lt;/li>
&lt;li>&lt;code>Task two&lt;/code> - Given the original headline and two manually edited versions, design a model to predict which edited version is the funnier of the two&lt;/li>
&lt;/ul>
&lt;h2 id="data-preprocessing">Data Preprocessing&lt;/h2>
&lt;h3 id="task-one">Task One&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>Convert original headlines into normal sentences (Remove &lt;code>&amp;lt;&lt;/code> and &lt;code>/&amp;gt;&lt;/code> by applying RE)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Get the edited version of headlines by doing word substitution using RE&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Do tokenization and lowercasing for each edited-original headlines pair&lt;/p>
&lt;p>Data preprocessing for pre-trained LMs (BERT-liked LMs):&lt;/p>
&lt;ul>
&lt;li>Version 1 - Concatenate original headlines and new headlines&lt;/li>
&lt;li>Version 2 - Concatenate new headlines and new words&lt;/li>
&lt;li>Version 3 – Contain only new headlines&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="task-two">Task Two&lt;/h3>
&lt;p>There are 3 versions of data preprocessing:&lt;/p>
&lt;ul>
&lt;li>The normal version&lt;/li>
&lt;li>The headlines truncated version&lt;/li>
&lt;li>The punctuation removal version&lt;/li>
&lt;/ul>
&lt;h2 id="models-choices--design">Models Choices &amp;amp; Design&lt;/h2>
&lt;h3 id="task-one1">Task One&lt;/h3>
&lt;ul>
&lt;li>Two Inputs FFNN&lt;/li>
&lt;li>Two Inputs CNN&lt;/li>
&lt;li>Two Inputs RNN&lt;/li>
&lt;li>Two Inputs Concatenated RNN&lt;/li>
&lt;li>Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/li>
&lt;/ul>
&lt;h4 id="two-inputs-ffnn">Two Inputs FFNN&lt;/h4>
&lt;p>This model is a two inputs’ feed forward neural network in which two input matrices representing all the original headlines and their corresponding
edited headlines respectively are passed simultaneously to the first so called embedding layer of the model to get the word embedding of the fixed
dimension for each word in the headline. Following above the model will do averaging for each headline to get the ‘document representation (vector)’
for each headline. Then these headlines’ vector representations are passed to a combination of three concatenated fully connected layers where the
information about “how humour they are” are encoded. The Relu activation is applied after output from each of the first two hidden layers to prevent
gradient vanishing and gradient exploding. Finally, all weighted sums or all vector products between the n-th row of the original matrix and the n-th
row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1) is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_FFNN.png" width="700" />
&lt;/p>
&lt;h4 id="two-inputs-cnn">Two Inputs CNN&lt;/h4>
&lt;p>This model uses text CNN architecture with single windows size instead of FFNN for the regression task. The original headlines tensor and the edited
headlines tensor are taken as the two inputs. In the output layer, unlike the normal matrix multiplication, all weighted sums or all vector products
between the n-th row of the original matrix and the n-th row of the edited matrix are computed such that a vector with the size (origin_headlines_num, 1)
is returned.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_cnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-rnn">Two Inputs RNN&lt;/h4>
&lt;p>This model uses single layer bidirectional RNN architecture for the regression task. It is again the same as Two Inputs CNN that takes two tensors as its
inputs and does a row-wise weighted summation in the output layer.&lt;/p>
&lt;p align="center">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/two_inputs_rnn.png" width="500" />
&lt;/p>
&lt;h4 id="two-inputs-concatenated-rnn">Two Inputs Concatenated RNN&lt;/h4>
&lt;p>This model is all the same as Two Inputs RNN except it concatenates the two last hidden states for the original headlines and the edited headlines to form
a single representation and do a normal matrix multiplication in the output layer.&lt;/p>
&lt;h4 id="pretrained-lm--a-regression-layer-lms-applied-bert-albert-xlnet-electra">Pre-trained LM + a regression layer (LMs applied: BERT, ALBERT, XLNet, ELECTRA)&lt;/h4>
&lt;h5 id="version-1--concatenate-original-headlines-and-new-headlines">Version 1 - Concatenate original headlines and new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_1.png" width="600" />
&lt;/p>
&lt;h5 id="version-2--concatenate-new-headlines-and-new-words">Version 2 - Concatenate new headlines and new words&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_2.png" width="600" />
&lt;/p>
&lt;h5 id="version-3--contain-only-new-headlines">Version 3 – Contain only new headlines&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/lm_inputs_version_3.png" width="600" />
&lt;/p>
&lt;h3 id="task-two1">Task Two&lt;/h3>
&lt;h4 id="pretrained-lm--a-classification-layer">Pre-trained LM + a classification layer&lt;/h4>
&lt;h5 id="concatenate-edited-headline-1-and-edited-headline-2">Concatenate edited headline 1 and edited headline 2&lt;/h5>
&lt;p align="left">
&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/2_seq_inputs_lm.png" width="650" />
&lt;/p>
&lt;h2 id="design-of-training-processes-for-task-two-only">Design of Training Processes (for task two only)&lt;/h2>
&lt;h3 id="version-1">Version 1:&lt;/h3>
&lt;ul>
&lt;li>Training the model “Pre-trained LM + a classification layer”
straightly for the real classification task&lt;/li>
&lt;/ul>
&lt;h3 id="version-2-fake-task--real-task">Version 2 (Fake Task + Real Task):&lt;/h3>
&lt;ul>
&lt;li>Firstly, training the model “Pre-trained LM + a regression layer” for a fake regression task on the training dataset&lt;/li>
&lt;li>After training well, get rid of the regression layer and add an initialized classification layer on top of the pre-trained LM&lt;/li>
&lt;li>Finally training the model for the real classification task&lt;/li>
&lt;/ul>
&lt;h2 id="optimizer--learning-rate-scheduler">Optimizer &amp;amp; Learning Rate Scheduler&lt;/h2>
&lt;h3 id="for-ffnn-cnn-rnn">For FFNN, CNN, RNN:&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>CosineAnnealingLR&lt;/code> provided by pytorch&lt;/li>
&lt;/ul>
&lt;h3 id="for-pretrained-lms-bertliked-lms">For pre-trained LMs (BERT-liked LMs):&lt;/h3>
&lt;ul>
&lt;li>The optimizer &lt;code>AdamW&lt;/code> and the scheduler &lt;code>get_linear_schedule_with_warmup&lt;/code> from &lt;a href="https://huggingface.co/transformers/index.html" title="huggingface">Huggingface transformers&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="prime-hyperparameters">Prime Hyperparameters&lt;/h2>
&lt;ul>
&lt;li>Learning Rate&lt;/li>
&lt;li>Fine-tuning Rate&lt;/li>
&lt;li>Adam Epsilon&lt;/li>
&lt;li>Weight Decay&lt;/li>
&lt;li>Warmup Ratio&lt;/li>
&lt;li>Number of Steps&lt;/li>
&lt;/ul>
&lt;h2 id="results">Results&lt;/h2>
&lt;h3 id="task-one2">Task One&lt;/h3>
&lt;h4 id="best-performance-achieved-by-two-inputs-ffnn">Best performance achieved by Two Inputs FFNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM_1&lt;/th>
&lt;th>HIDDEN_DIM_2&lt;/th>
&lt;th>HIDDEN_DIM_3&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>100&lt;/td>
&lt;td>0.145&lt;/td>
&lt;td>300&lt;/td>
&lt;td>100&lt;/td>
&lt;td>50&lt;/td>
&lt;td>10&lt;/td>
&lt;td>0.575&lt;/td>
&lt;td>0.581&lt;/td>
&lt;td>0.576&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-cnn">Best performance achieved by Two Inputs CNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>FC_OUT_DIM&lt;/th>
&lt;th>N_OUT_CHANNELS&lt;/th>
&lt;th>WINDOW_SIZE&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>500&lt;/td>
&lt;td>5e-3&lt;/td>
&lt;td>50&lt;/td>
&lt;td>25&lt;/td>
&lt;td>100&lt;/td>
&lt;td>3&lt;/td>
&lt;td>0.7&lt;/td>
&lt;td>0.624&lt;/td>
&lt;td>0.661&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-two-inputs-rnn">Best performance achieved by Two Inputs RNN&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>EPOCHS&lt;/th>
&lt;th>LRATE&lt;/th>
&lt;th>EMBEDDING_DIM&lt;/th>
&lt;th>HIDDEN_DIM&lt;/th>
&lt;th>FC_OUTPUT_DIM&lt;/th>
&lt;th>BIDIRECTIONAL&lt;/th>
&lt;th>DROPOUT&lt;/th>
&lt;th>Train Loss&lt;/th>
&lt;th>Val. Loss&lt;/th>
&lt;th>Test Loss&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>30&lt;/td>
&lt;td>1e-4&lt;/td>
&lt;td>50&lt;/td>
&lt;td>128&lt;/td>
&lt;td>32&lt;/td>
&lt;td>Ture&lt;/td>
&lt;td>0.3&lt;/td>
&lt;td>0.586&lt;/td>
&lt;td>0.576&lt;/td>
&lt;td>0.571&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="best-performance-achieved-by-pretrained-lms">Best performance achieved by Pre-trained LMs&lt;/h4>
&lt;ul>
&lt;li>Without Data Augmentation
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>Test loss: 0.52937&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>With Data Augmentation (add “funlines” training dataset)
&lt;ul>
&lt;li>Model: bert_base_uncased&lt;/li>
&lt;li>Inputs structure: new headlines + new words&lt;/li>
&lt;li>&lt;code>Test loss: 0.52054 (Best performance achieved among all trials)&lt;/code>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task1_log.png" alt="task1_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T1 Pre-trained LMs Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h3 id="task-two2">Task Two&lt;/h3>
&lt;h4 id="version-1-straightly-training-the-model-for-the-real-task">Version 1: Straightly training the model for the real task&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log1.png" alt="task2_v1_log1">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 1&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v1_log2.png" alt="task2_v1_log2">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Log 2&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h4 id="version-2-fake-task-training--real-task-training">Version 2: Fake Task Training + Real Task Training&lt;/h4>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_f_log.png" alt="task2_v2_f_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Fake Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="center">&lt;img src="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/images/task2_v2_r_log.png" alt="task2_v2_r_log">&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="center">&lt;em>T2 Real Task Log&lt;/em>&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="discussion">Discussion&lt;/h2>
&lt;h3 id="task-one3">Task One&lt;/h3>
&lt;ul>
&lt;li>The performance of Two Inputs RNN is just slightly better compared with that of the Two Inputs FFNN (0.5759702196 vs. 0.5751694002) while the time complexity
of the Two Inputs RNN is much higher than the Two Inputs FFNN, at some point the current version of Two Inputs RNN is resources-wasted.&lt;/li>
&lt;li>The Two Inputs CNN with a single window size performs worse than the Two Inputs FFNN and the Two Inputs RNN, and one of possible reasons is that it only looks
at one size of n-gram and hence ignores the knowledge of n-grams with different lengths.&lt;/li>
&lt;/ul>
&lt;h3 id="task-two3">Task Two&lt;/h3>
&lt;ul>
&lt;li>For different preprocessing methods, the headlines truncated version and the punctuation removal version have the same performance as the normal one except
that truncating headlines will reduce the training time for a single epoch.&lt;/li>
&lt;li>The issue of overfitting on the training dataset is hard to overcome when applying BERT-liked pre-trained LMs (Although several methods, such as data
augmentation, weight decay and dropout increase have been tried to mitigate this problem.&lt;/li>
&lt;li>Surprisingly, the fake task training for pre-trained LMs does not help to improve the performance of the model in real task even a little bit.&lt;/li>
&lt;li>With the same hyperparameters setting for the certain task, the performance of the newly proposed pre-trained LM is not necessarily the best.&lt;/li>
&lt;/ul>
&lt;h2 id="prospective">Prospective&lt;/h2>
&lt;ul>
&lt;li>Construct a pretrain LM to do a binary classification task in which the model learns to decide whether a word from the edited new headline is original or
edited. Take the embeddings out of the pretrain model and use it to initialize the model for the real regression task. By doing so we expect the embeddings can
be informed some knowledge about the relationship between original headlines and edited headlines.&lt;/li>
&lt;li>Build up a pretrain LM to do a text translation task on the training dataset and use the embeddings of this model to initialize the model for the real
regression task. (Aim to learn the semantics of funniness).&lt;/li>
&lt;li>Intuitively thinking the performance of the Two Inputs CNN might be improved by increasing the number of the window sizes (different n-gram filters).&lt;/li>
&lt;li>Applying the pre-trained LM Longformer rather than other BERT-liked models for the task two, in which the Longformer has the ‘global attention mask’ and it
can probably better model the relationship between the edited word and the other words in a headline (e.g. &lt;code>How important is the edited word for the whole headline in order to make it funnier?&lt;/code> / &lt;code>How does the edited word contribute to the meaning of the whole sentence?&lt;/code>).&lt;/li>
&lt;/ul>
&lt;h2 id="license">License&lt;/h2>
&lt;p>This project following MIT License as written in the &lt;a href="https://github.com/JackyLin97/2020_NLP_Funniness_Estimation-PyTorch/raw/master/LICENSE">LICENSE&lt;/a> file.&lt;/p>
&lt;hr></description></item></channel></rss>