Artificial Intelligence | Ziyang Lin

LLM Agent Multi-Turn Dialogue: Architecture Design and Implementation Strategies

Mon, 30 Jun 2025 11:00:00 +0000

1. Introduction: Why Multi-Turn Dialogue is the Core Lifeline of Agents

In the wave of human-machine interaction, Large Language Model (LLM) driven Agents are evolving from simple “question-answer” tools into “intelligent assistants” capable of executing complex tasks with reasoning and planning abilities. The core of this evolution lies in Multi-turn Dialogue capabilities.

Single-turn dialogue resembles a one-time query, while multi-turn dialogue is a continuous, memory-driven, goal-oriented exchange. Users may not provide all information at once, requiring Agents to understand evolving needs, clarify ambiguous instructions, call external tools, and ultimately achieve the user's goals through continuous interaction.

This document will thoroughly analyze the core challenges faced by LLM Agents in implementing efficient and reliable multi-turn dialogues, and provide a detailed explanation of current mainstream technical architectures and implementation details.

2. Core Challenges: “Thorny Issues” in Multi-Turn Dialogues

To build a powerful multi-turn dialogue Agent, we must address several fundamental challenges:

2.1 Context Window Limitation

This is the most fundamental physical constraint. LLMs can only process a limited length of text (tokens). As conversation turns increase, the complete dialogue history quickly exceeds the model's context window.

Macro Issue: Leads to “memory loss,” where the Agent cannot recall early critical information, causing dialogue coherence to break.
Underlying Details: Directly truncating early dialogue history is the simplest but crudest method, potentially losing important premises. For example, preferences set by the user at the beginning of a conversation (“I prefer window seats”) might be forgotten during subsequent booking steps.

2.2 State Maintenance Complexity

Agents need to precisely track the dialogue state, such as: What stage is the current task at? What information has the user provided? What information is still needed?

Macro Issue: If the state is confused, the Agent appears “muddled,” repeatedly asking for known information or getting “lost” in the task flow.
Underlying Details: State is more than just dialogue history. It's a structured data collection that may include user intent, extracted entities (like dates, locations), API call results, current task nodes, etc. Designing a robust, scalable state management mechanism is a significant engineering challenge.

2.3 Intent Drifting & Goal Forgetting

In long conversations, user intent may change, or a large goal may be broken down into multiple subtasks.

Macro Issue: Agents need to understand and adapt to these dynamic changes rather than rigidly adhering to the initial goal. If a user checks the weather and then says, “Book me a flight there,” the Agent must recognize this as a new, related intent.
Underlying Details: This requires the Agent to have strong intent recognition and reasoning capabilities to determine whether the current user input is continuing, modifying, or starting a completely new task.

2.4 Error Handling & Self-Correction

When tool calls fail (e.g., API timeout), information extraction errors occur, or understanding deviates, the Agent cannot simply crash or give up.

Macro Issue: A reliable Agent should be able to identify failures and proactively initiate correction processes, such as retrying, clarifying with the user, or finding alternatives.
Underlying Details: This requires designing fault tolerance and retry mechanisms at the architectural level. The Agent needs to “understand” error messages returned by tools and generate new “thoughts” based on these to plan the next corrective action.

3. Technical Architecture Evolution and Analysis

To address the above challenges, the industry has explored various solutions, from simple history compression to complex Agentic architectures.

3.1 Early Attempts: Dialogue History Compression

This is the most direct approach to solving context window limitations.

Summary Memory: After each round of dialogue, or when the history length approaches a threshold, another LLM call summarizes the existing conversation.
- Advantage: Effectively reduces length.
- Disadvantage: The summarization process may lose details and adds additional LLM call costs and latency.

3.2 ReAct Architecture: Giving Agents the Ability to “Think”

ReAct (Reason + Act) is the cornerstone of today's mainstream Agent architectures. Through an elegant “think-act-observe” cycle, it transforms an LLM from a mere text generator into an entity with reasoning and execution capabilities.

Macro Concept: Mimics the human problem-solving pattern—first analyze (Reason), then take action (Act), and finally observe results (Observation) and adjust approach.
Underlying Implementation: Through carefully designed prompts, guides the LLM to generate text with specific markers.
- Thought: The LLM performs an “inner monologue” at this step, analyzing the current situation and planning the next action. This content is invisible to users.
- Action: The LLM decides which tool to call and what parameters to pass. For example, search("Beijing weather today").
- Observation: Feeds back the results of tool execution (such as API returned data, database query results) to the LLM.

This cycle repeats until the Agent considers the task complete.

ReAct Work Cycle

graph TD
A["User Input"] --> B{"LLM Generates Thought and Action"};
B -- Thought --> C["Inner Monologue: What should I do?"];
C --> D{"Action: Call Tool"};
D -- "Tool Input" --> E["External Tool (API, DB)"];
E -- "Tool Output" --> F["Observation: Get Result"];
F --> G{"LLM Generates New Thought Based on Observation"};
G -- "Thought" --> H["Inner Monologue: ..."];
H --> I{"Is Task Complete?"};
I -- "No" --> D;
I -- "Yes" --> J["Final Answer"];
J --> K["Respond to User"];

3.3 Finite State Machine (FSM): Building “Tracks” for Dialogue Flow

For tasks with clear goals and relatively fixed processes (such as food ordering, customer service), Finite State Machines (FSM) are an extremely powerful and reliable architecture.

Macro Concept: Abstract complex dialogue processes into a series of discrete “states” and “transition conditions” between these states. The Agent is in a clear state at any moment and can only transition to the next state through predefined paths.
Underlying Implementation:
- States: Define possible nodes in the dialogue, such as AskLocation, AskCuisine, ConfirmOrder, OrderPlaced.
- Transitions: Define rules for state switching, typically triggered by user input or tool output. For example, in the AskLocation state, if location information is successfully extracted from user input, transition to the AskCuisine state.
- State Handler: Each state is associated with a handler function responsible for executing specific logic in that state (such as asking the user questions, calling APIs).

A Simple Food Ordering Agent

stateDiagram-v2
[*] --> Awaiting_Order
Awaiting_Order: User initiates food order
Awaiting_Order --> Collect_Cuisine: Identify ordering intent
Collect_Cuisine: "What cuisine would you like?"
Collect_Cuisine --> Collect_Headcount: User provides cuisine
Collect_Headcount: "How many people dining?"
Collect_Headcount --> Confirmation: User provides headcount
state Confirmation {
direction LR
[*] --> Show_Summary
Show_Summary: "Booking [headcount] for [cuisine], confirm?"
Show_Summary --> Finalize: User confirms
Finalize --> [*]
}
Confirmation --> Collect_Cuisine: User modifies

Modern Evolution of FSM: Dynamic and Hierarchical

Traditional FSMs rely on hardcoded rules for state transitions, which can be rigid when facing complex, changing real-world scenarios. Modern Agent design deeply integrates FSM with LLM capabilities, giving rise to more intelligent and flexible architectures.

LLM-Driven State Transitions: Rather than using fixed if-else rules to determine state changes, let the LLM make decisions. In each cycle, pass the dialogue history, current user input, and a list of all possible target states to the LLM, allowing it to determine the most appropriate next state based on its powerful context understanding. This upgrades state transitions from “rule-driven” to “intelligence-driven.”
State-Specific Prompts: This is a powerful application of dynamic prompting. For each core state node in the FSM, a highly optimized set of dedicated prompts can be pre-designed. When the Agent enters a certain state (such as Collect_Cuisine), the system immediately activates the prompt corresponding to that state. This prompt not only guides the LLM on how to interact with users at that node but can also define tools that can be called in that state, rules to follow, etc. This allows the Agent to “wear different hats” at different task stages, exhibiting high professionalism and task relevance.

Example: State-Specific Prompt for `Query_Flights` State in Flight Booking Sub-Process

# IDENTITY
You are a world-class flight booking assistant AI.
# STATE & GOAL
You are currently in the "Query_Flights" state.
Your SOLE GOAL is to collect the necessary information to search for flights.
The necessary information is: origin city, destination city, and departure date.
# AVAILABLE TOOLS
- `flight_search_api(origin: str, destination: str, date: str)`: Use this tool to search for flights.
# CONTEXT
- Conversation History:
{conversation_history}
- User Profile:
{user_profile}
- Current State Data:
{state_data} # e.g., {"origin": "Shanghai", "destination": "Beijing", "date": null}
# RULES
1. Analyze the Current State Data first.
2. If any necessary information (origin, destination, date) is missing, you MUST ask the user for it clearly.
3. Phrase your questions to sound helpful and natural.
4. Once all information is collected, your FINAL ACTION MUST be to call the `flight_search_api` tool with the correct parameters.
5. Do not make up information. Do not ask for information that is not required (e.g., return date, unless specified by the user).
# OUTPUT FORMAT
Your output must be a single JSON object.
- To ask a question: {"action": "ask_user", "question": "Your question here."}
- To call a tool: {"action": "call_tool", "tool_name": "flight_search_api", "tool_params": {"origin": "...", "destination": "...", "date": "..."}}

Hierarchical FSM: For large complex tasks, a single flat state diagram is difficult to manage. Hierarchical FSMs introduce the concept of “SOP nesting” or “sub-state diagrams.” A high-level FSM (main SOP) is responsible for planning the macro business process (such as “complete a travel booking”), and when the process reaches a certain macro state (such as “book flight”), it can activate an embedded, more detailed sub-FSM (sub-SOP) that specifically handles a series of refined operations like “query flights -> select seats -> confirm payment.” This pattern greatly enhances the modularity and manageability of task decomposition.

Hierarchical State Machine (SOP Nesting) Example

stateDiagram-v2
direction LR
[*] --> MainSOP
state "Main Process: Travel Planning (Main SOP)" as MainSOP {
[*] --> Collect_Trip_Info
note right of Collect_Trip_Info
User: "Help me plan a trip to Beijing"
end note
Collect_Trip_Info --> Book_Flight_Sub_SOP : "OK, let's book flights first"
state "Sub-Process: Flight Booking" as Book_Flight_Sub_SOP {
direction LR
[*] --> Query_Flights: "When do you want to depart?"
Query_Flights --> Select_Seat: "Found flights, please select seat"
Select_Seat --> Confirm_Payment: "Seat selected, please pay"
Confirm_Payment --> [*]: Payment successful
}
Book_Flight_Sub_SOP --> Book_Hotel: "Flight booked, now for hotel"
Book_Hotel --> Finalize_Trip: "Hotel booked, final confirmation"
Finalize_Trip --> [*]
}

FSM vs. ReAct: FSM is structured, predictable, and easy to debug, making it very suitable for task-oriented dialogues. ReAct is more flexible and versatile, suitable for handling open-ended tasks requiring complex reasoning and dynamic planning. In practice, the two are often combined (for example, using ReAct to handle an open-ended subtask within an FSM state, or as mentioned above, using an LLM to drive FSM state transitions).

4. Core Components: Agent's “Memory” System

Regardless of the architecture used, a powerful memory system is the cornerstone of effective multi-turn dialogue.

4.1 Short-term Memory

Also known as working memory, primarily responsible for storing recent dialogue history.

Typical Implementation: ConversationBufferMemory or ConversationBufferWindowMemory.
Underlying Details:
- ConversationBufferMemory: Stores complete dialogue history. Simple and direct, but quickly exhausts the context window in long conversations.
- ConversationBufferWindowMemory: Only keeps the most recent k turns of dialogue. This sliding window mechanism effectively controls length but risks losing important early information.

4.2 Long-term Memory

Responsible for storing cross-dialogue, persistent knowledge and information.

Typical Implementation: Retrieval-Augmented Generation (RAG) based on vector databases.
Underlying Details:
1. Chunk external documents (such as product manuals, knowledge base articles) or key information from past conversations.
2. Use an Embedding model to convert text blocks into vectors.
3. Store vectors in a vector database (such as Chroma, Pinecone, FAISS).
4. When a user asks a question, convert their question into a vector as well.
5. Perform similarity search in the vector database to find the most relevant text blocks.
6. Inject these text blocks as context along with the user's question into the LLM's prompt, guiding it to generate more precise answers.

4.3 Structured Memory

Stores and retrieves information in a structured way, especially key entities and their relationships from conversations.

Typical Implementation: Entity-relationship storage based on knowledge graphs, such as the Graphiti project using Neo4j.
Underlying Details:
- Knowledge Graph Advantages: Unlike simple key-value storage, knowledge graphs can capture complex relationship networks between entities. For example, not just recording a person named “John,” but also recording “John is Mary's manager,” “John is responsible for Project A,” and other relationship information.
- Graphiti Project Analysis: Graphiti is a knowledge graph memory system designed specifically for LLM Agents, seamlessly integrating Neo4j's graph database capabilities with LLM's natural language processing abilities.
  - Core Workflow:
    1. Entity and Relationship Extraction: LLM analyzes conversation content, identifying key entities and their relationships
    2. Graph Construction: Transforms identified entities and relationships into Cypher query statements, dynamically updating the Neo4j graph database
    3. Context Enhancement: In subsequent conversations, retrieves relevant entity networks through graph queries, injecting them as context into the LLM's prompt
  - Technical Highlights:
    - Automatic Schema Inference: No need to predefine entity types and relationships; the system can automatically infer appropriate graph structures from conversations
    - Incremental Updates: As conversations progress, the graph is continuously enriched and corrected, forming an increasingly complete knowledge network
    - Relationship Reasoning: Supports multi-hop queries, able to discover indirectly associated information (e.g., “Who are the colleagues of John's manager?")
    - Temporal Awareness: Graphiti/Zep's core feature is its Temporal Knowledge Graph architecture, where each node and relationship carries timestamp attributes, enabling the system to:
      - Track how entity states change over time (e.g., “John was a developer last year, promoted to project manager this year”)
      - Perform temporal reasoning (e.g., “What was B's status before event A occurred?")
      - Resolve time-related queries (e.g., “How is the project mentioned last month progressing now?")
      - Automatically identify and handle outdated information, ensuring answers are based on the latest factual state
      - Build event timelines, helping the Agent understand causal relationships and event sequences
- Practical Application Example:
```
from graphiti import GraphMemory
# Initialize graph memory
graph_memory = GraphMemory(
neo4j_uri="neo4j://localhost:7687",
neo4j_user="neo4j",
neo4j_password="password"
)
# Update graph in conversation
user_message = "My project manager John said we're starting a new project next week"
graph_memory.update_from_text(user_message)
# Retrieve relevant information in subsequent conversations
query = "Who is the project manager?"
context = graph_memory.retrieve_relevant_context(query)
# Returns: "John is the project manager, responsible for a new project starting next week."
```
- Comparison with Traditional Entity Memory: Traditional methods can only store flat entity-attribute pairs, while knowledge graph methods can express and query complex multi-level relationship networks, providing Agents with richer, more insightful contextual information.
- Essentially a Form of Long-term Memory: Although we discuss structured memory as a separate category, knowledge graph systems like Graphiti/Zep are essentially an advanced form of long-term memory. They not only persistently store information across conversations but also organize this information in a more structured, queryable, and reasoning-friendly way. Compared to semantic similarity retrieval in vector databases, knowledge graphs provide more precise relationship navigation and reasoning capabilities.

Graphiti/Zep Temporal Knowledge Graph Architecture and Workflow

graph TD
subgraph "User Conversation History"
A1["Conversation 1: 'I'm John, a software engineer'"] --> A2["Conversation 2: 'I'm responsible for Project A'"]
A2 --> A3["Conversation 3: 'I was a developer last year, promoted to project manager this year'"]
A3 --> A4["Conversation 4: 'Mary is a member of my team'"]
end
subgraph "Entity and Relationship Extraction"
B["LLM Analyzer"] --> C["Entity Recognition: John, Project A, Mary"]
B --> D["Relationship Extraction: John-responsible for-Project A, John-manages-Mary"]
B --> E["Temporal Attributes: John.role(2024)=project manager, John.role(2023)=developer"]
end
subgraph "Temporal Knowledge Graph"
F["John (Person)"] -- "role(2023)" --> G["Developer"]
F -- "role(2024)" --> H["Project Manager"]
F -- "responsible for(2024)" --> I["Project A"]
F -- "manages(2024)" --> J["Mary (Person)"]
end
subgraph "Query and Reasoning"
K["User Question: 'What was John's position last year?'"]
L["Graph Query: MATCH (p:Person {name:'John'})-[r:role {year:2023}]->(role) RETURN role"]
M["Result: 'Developer'"]
N["Temporal Reasoning: 'John's career progression is from developer to project manager'"]
end
A4 --> B
E --> F
K --> L
L --> M
M --> N
style F fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#bbf,stroke:#333,stroke-width:2px
style J fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#bfb,stroke:#333,stroke-width:2px
style H fill:#bfb,stroke:#333,stroke-width:2px

This diagram shows how Graphiti/Zep transforms conversation history into a knowledge graph with a temporal dimension, supporting time-based queries and reasoning. Timestamps enable the system to track the evolution of entity attributes and relationships, answering “when” and “how changed” types of questions, capabilities that traditional knowledge graphs and vector stores struggle to achieve.

4.4 Summary Memory

As mentioned earlier, saves space by creating rolling summaries of dialogue history.

Typical Implementation: ConversationSummaryMemory or ConversationSummaryBufferMemory.
Underlying Details:
- ConversationSummaryMemory: Summarizes the entire dialogue history each time, which is costly.
- ConversationSummaryBufferMemory: A hybrid strategy. It keeps the most recent k turns of complete dialogue while maintaining a rolling summary of earlier conversations. This achieves a good balance between cost and information fidelity.

4.5 User Profile Memory

This is a more proactive, advanced form of structured memory, aimed at going beyond single conversations to establish a persistent, dynamically updated “profile” for users. The Agent not only remembers conversation content but also “who you are.”

Macro Concept: Structurally store user preferences, habits, historical choices, and even demographic information (with user authorization). In each interaction, inject this “user profile” as key context directly into the prompt, allowing the LLM to “understand” its conversation partner from the start.
Underlying Implementation:
1. Data Structure: Typically maintains user metadata in the form of key-value pairs (such as JSON objects). For example: {"user_id": "123", "preferred_language": "English", "dietary_restrictions": ["vegetarian"], "home_city": "Shanghai"}.
2. Prompt Injection: When building the final prompt, include the serialized user profile string (such as [UserProfile]...[/UserProfile]) as a fixed part of the context.
3. Dynamic Maintenance: This is the core of the mechanism. After a conversation ends, the Agent or a background process analyzes the interaction to determine if the user profile needs updating. For example, when a user says “I recently moved to Beijing,” the system needs a mechanism to update the home_city field. This update process itself may require a separate LLM call for information extraction and decision-making.
Advantages:
- High Personalization: The Agent can provide forward-looking, highly customized services.
- Conversation Efficiency: Avoids repeatedly asking users for basic preferences, making interactions smoother.
Challenges:
- Update Mechanism Complexity: How to accurately and safely update user profiles is a technical challenge.
- Token Consumption: User profiles occupy valuable context window space.
- Data Privacy: Must strictly adhere to user privacy policies.

5. Summary and Outlook

Building an LLM Agent capable of smooth, intelligent multi-turn dialogue is a complex system engineering task. It requires us to:

Face Physical Limitations: Overcome context window bottlenecks through clever memory management mechanisms (such as summaries, RAG).
Choose Appropriate Architecture: Balance flexibility (ReAct) and structure (FSM) based on task complexity, or even combine both.
Design Robust Processes: Build in state tracking, intent recognition, and error correction capabilities to keep the Agent stable and reliable in complex interactions.

Future development will focus more on the Agent's autonomous learning and evolution capabilities. Agents will not only execute tasks but also learn new skills from interactions with users, optimize their tool calling strategies, and dynamically adjust their conversation style, ultimately becoming truly personalized intelligent partners.

Retrieval-Augmented Generation (RAG): A Comprehensive Technical Analysis

Mon, 30 Jun 2025 10:00:00 +0000

1. Macro Overview: Why RAG?

1.1 What is RAG?

RAG, or Retrieval-Augmented Generation, is a technical framework that combines information retrieval from external knowledge bases with the powerful generative capabilities of large language models (LLMs). In simple terms, when a user asks a question, a RAG system first retrieves the most relevant information snippets from a vast, updatable knowledge base (such as company internal documents, product manuals, or the latest web information), and then “feeds” this information along with the original question to the language model, enabling it to generate answers based on precise, up-to-date context.

To use an analogy: Imagine a student taking an open-book exam. This student (the LLM) has already learned a lot of knowledge (pre-training data), but when answering very specific questions or those involving the latest information, they can refer to reference books (external knowledge base). RAG is this “open-book” process, allowing the LLM to consult the most recent and authoritative materials when answering questions, thus providing more accurate and comprehensive answers.

1.2 RAG's Core Value: Solving LLM's Inherent Limitations

Despite their power, large language models have several inherent limitations that RAG technology specifically addresses.

Limitation 1: Knowledge Cut-off

An LLM's knowledge is frozen at the time of its last training. For example, a model completed in early 2023 cannot answer questions about events that occurred after that point. RAG completely solves this problem by introducing an external knowledge base that can be updated at any time. Companies can update their knowledge bases with the latest product information, financial reports, market dynamics, etc., and the RAG system can immediately leverage this new knowledge to answer questions.

Limitation 2: Hallucination

When LLMs encounter questions outside their knowledge domain or with uncertain answers, they sometimes “confidently make things up,” fabricating facts and producing what are known as “hallucinations.” RAG greatly constrains model output by providing clear, fact-based reference materials. The model is required to answer based on the retrieved context, which effectively defines the scope of its response, significantly reducing the probability of hallucinations.

Limitation 3: Lack of Domain-Specific Knowledge

General-purpose LLMs often perform poorly when handling specialized questions in specific industries or enterprises. For example, they don't understand a company's internal processes or the technical specifications of particular products. Through RAG, enterprises can build a specialized knowledge base containing internal regulations, technical documentation, customer support records, and more. This equips the LLM with domain expert knowledge, enabling it to handle highly specialized Q&A tasks.

Limitation 4: Lack of Transparency & Interpretability

The answer generation process of traditional LLMs is a “black box” - we cannot know what information they based their conclusions on. This is fatal in fields requiring high credibility, such as finance, healthcare, and law. The RAG architecture naturally enhances transparency because the system can clearly show “I derived this answer based on these documents (Source 1, Source 2…)". Users can trace and verify the sources of information, greatly enhancing trust in the answers.

1.3 RAG's Macro Workflow

At the highest level, RAG's workflow can be depicted as a simple yet elegant architecture.

graph TD
A["User Query"] --> B{RAG System};
B --> C["Retrieve"];
C --> D["External Knowledge Base"];
D --> C;
C --> E["Augment"];
A --> E;
E --> F["Generate"];
F --> G[LLM];
G --> F;
F --> H["Final Answer with Sources"];

This workflow can be interpreted as:

Retrieve: After receiving a user's question, the system first converts it into a format suitable for searching (such as a vector), then quickly matches and retrieves the most relevant information snippets from the knowledge base.
Augment: The system integrates the retrieved information snippets with the user's original question into a richer “prompt.”
Generate: This enhanced prompt is sent to the LLM, guiding it to generate a content-rich and accurate answer based on the provided context, along with sources of information.

Through this process, RAG successfully transforms the LLM from a “closed-world scholar” into an “open-world, verifiable expert.”

2. RAG Core Architecture: Dual Process Analysis

The lifecycle of a RAG system can be clearly divided into two core processes:

Offline Process: Indexing: This is a preprocessing stage responsible for transforming raw data sources into a knowledge base ready for quick retrieval. This process typically runs in the background and is triggered whenever the knowledge base content needs updating.
Online Process: Retrieval & Generation: This is the real-time process of user interaction with the system, responsible for retrieving information from the index based on user input and generating answers.

Below, we'll analyze these two processes through detailed diagrams and explanations.

2.1 Offline Process: Indexing

The goal of this process is to transform unstructured or semi-structured raw data into structured, easily queryable indices.

graph TD
subgraph "Offline Indexing Pipeline"
A["Data Sources"] --> B["Load"];
B --> C["Split/Chunk"];
C --> D["Embed"];
D --> E["Store/Index"];
end
A --> A_Details("e.g.: PDFs, .txt, .md, Notion, Confluence, databases");
B --> B_Details("Using data loaders, e.g., LlamaIndex Readers");
C --> C_Details("Strategies: fixed size, recursive splitting, semantic chunking");
D --> D_Details("Using Embedding models, e.g., BERT, Sentence-BERT, a-e-5-large-v2");
E --> E_Details("Store in vector databases, e.g., Chroma, Pinecone, FAISS");

Process Details:

Load: The system first needs to load original documents from various specified data sources. These sources can be diverse, such as PDF files, Markdown documents, web pages, Notion pages, database records, etc. Modern RAG frameworks (like LlamaIndex, LangChain) provide rich data loader ecosystems to simplify this process.
Split/Chunk: Due to the limited context window of language models, directly embedding a long document (like a PDF with hundreds of pages) as a single vector performs poorly and loses many details. Therefore, it's essential to split long texts into smaller, semantically complete chunks. The chunking strategy is crucial and directly affects retrieval precision.
Embed: This is the core step of transforming textual information into machine-understandable mathematical representations. The system uses a pre-trained embedding model to map each text chunk to a high-dimensional vector. This vector captures the semantic information of the text, with semantically similar text chunks being closer to each other in the vector space.
Store/Index: Finally, the system stores the vector representations of all text chunks along with their metadata (such as source document, chapter, page number, etc.) in a specialized database, typically a vector database. Vector databases are specially optimized to support efficient similarity searches across massive-scale vector data.

2.2 Online Process: Retrieval & Generation

This process is triggered when a user submits a query, with the goal of generating precise, evidence-based answers in real-time.

graph TD
A["User Query"] --> B["Embed Query"];
B --> C["Vector Search"];
C <--> D["Vector Database"];
C --> E["Get Top-K Chunks"];
E --> F["(Optional) Re-ranking"];
A & F --> G["Build Prompt"];
G --> H["LLM Generation"];
H --> I["Final Answer"];

Process Details:

Embed Query: When a user inputs a question, the system uses the same embedding model as in the indexing phase to convert this question into a query vector.
Vector Search: The system takes this query vector and performs a similarity search in the vector database. The most common algorithm is “K-Nearest Neighbors” (KNN), aiming to find the K text chunk vectors closest to the query vector in the vector space.
Get Top-K Chunks: Based on the search results, the system retrieves the original content of these K most relevant text chunks from the database. These K text chunks form the core context for answering the question.
Re-ranking (Optional): In some advanced RAG systems, there's an additional re-ranking step. This is because high vector similarity doesn't always equate to high relevance to the question. A re-ranker is a lighter-weight model that re-examines the relevance of these Top-K text chunks to the original question and reorders them, selecting the highest quality ones as the final context.
Build Prompt: The system combines the original question and the filtered context information according to a predefined template into a complete prompt. This prompt typically includes instructions like: “Please answer this question based on the following context information. Question: […] Context: […]".
LLM Generation: Finally, this enhanced prompt is sent to the large language model (LLM). The LLM, following the instructions, comprehensively utilizes its internal knowledge and the provided context to generate a fluent, accurate, and information-rich answer. The system can also cite the sources of the context, enhancing the credibility of the answer.

3. Indexing Deep Dive

Indexing is the cornerstone of RAG systems. The quality of this process directly determines the effectiveness of subsequent retrieval and generation phases. A well-designed indexing process ensures that information in the knowledge base is accurately and completely transformed into retrievable units. Let's explore each component in depth.

3.1 Data Loading

The first step is to load raw data from various sources into the processing pipeline.

Loaders: Modern RAG frameworks provide powerful loader ecosystems. For example, LangChain's Document Loaders support loading data from over 100 different sources, including:
- Files: TextLoader (plain text), PyPDFLoader (PDF), JSONLoader, CSVLoader, UnstructuredFileLoader (capable of processing Word, PowerPoint, HTML, XML, and other formats).
- Web Content: WebBaseLoader (web scraping), YoutubeLoader (loading YouTube video captions).
- Collaboration Platforms: NotionDirectoryLoader, ConfluenceLoader.
- Databases: AzureCosmosDBLoader, PostgresLoader.

Choosing the right loader allows enterprises to easily integrate their existing knowledge assets into RAG systems without complex data format conversions.

3.2 Text Splitting / Chunking

Why is chunking necessary? Directly vectorizing an entire document (like a PDF with hundreds of pages) is impractical for three reasons:

Context Length Limitations: Most embedding models and LLMs have token input limits.
Noise Issues: A single vector representing a lengthy document contains too many topics and details, diluting the semantic information and making it difficult to precisely match specific user questions during retrieval.
Retrieval Cost: Feeding an entire document as context to an LLM consumes substantial computational resources and costs.

Therefore, splitting documents into semantically related chunks is a crucial step. The quality of chunks determines the ceiling of RAG performance.

3.2.1 Core Parameters: `chunk_size` and `chunk_overlap`

chunk_size: Defines the size of each text block, typically calculated in character count or token count. Choosing this value requires balancing “information density” and “context completeness.” Too small may fragment complete semantics; too large may introduce excessive noise.
chunk_overlap: Defines the number of characters (or tokens) that overlap between adjacent text blocks. Setting overlap can effectively prevent cutting off a complete sentence or paragraph at block boundaries, ensuring semantic continuity.

3.2.2 Mainstream Chunking Strategies

The choice of chunking strategy depends on the structure and content of the document.

Strategy 1: Character Splitting

Representative: CharacterTextSplitter
Principle: This is the simplest direct method. It splits text based on a fixed character (like \n\n newline) and then forcibly chunks according to the preset chunk_size.
Advantages: Simple, fast, low computational cost.
Disadvantages: Completely ignores the semantics and logical structure of the text, easily breaking sentences in the middle or abruptly cutting off complete concept descriptions.
Applicable Scenarios: Suitable for texts with no obvious structure or where semantic coherence is not a high requirement.

# Example: CharacterTextSplitter
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)

Strategy 2: Recursive Character Splitting

Representative: RecursiveCharacterTextSplitter
Principle: This is currently the most commonly used and recommended strategy. It attempts to split recursively according to a set of preset separators (like ["\n\n", "\n", " ", ""]). It first tries to split using the first separator (\n\n, paragraph); if the resulting blocks are still larger than chunk_size, it continues using the next separator (\n, line) to split these large blocks, and so on until the block size meets requirements.
Advantages: Makes the greatest effort to maintain the integrity of paragraphs, sentences, and other semantic units, striking a good balance between universality and effectiveness.
Disadvantages: Still based on character rules rather than true semantic understanding.
Applicable Scenarios: The preferred strategy for the vast majority of scenarios.

# Example: RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)

Strategy 3: Token-Based Splitting

Representative: TokenTextSplitter, CharacterTextSplitter.from_tiktoken_encoder
Principle: It calculates chunk_size by token count rather than character count. This is more consistent with how language models process text and allows for more precise control over the length of content input to the model.
Advantages: More precise control over cost and input length for model API calls.
Disadvantages: Computation is slightly more complex than character splitting.
Applicable Scenarios: When strict control over costs and API call input lengths is needed.

Strategy 4: Semantic Chunking

Principle: This is a more advanced experimental method. Instead of being based on fixed rules, it's based on understanding the semantics of the text. The splitter calculates embedding similarity between sentences and splits when it detects that the semantic difference between adjacent sentences exceeds a threshold.
Advantages: Can generate highly semantically consistent text blocks, theoretically the best splitting method.
Disadvantages: Very high computational cost, as it requires multiple embedding calculations during the splitting phase.
Applicable Scenarios: Scenarios requiring extremely high retrieval quality, regardless of computational cost.

3.3 Embedding

Embedding is the process of transforming text chunks into high-dimensional numerical vectors, which serve as mathematical representations of the text's semantics.

3.3.1 Embedding Model Selection

The choice of embedding model directly affects retrieval quality and system cost.

Closed-Source Commercial Models (e.g., OpenAI):
- Representatives: text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
- Advantages: Powerful performance, typically ranking high in various evaluation benchmarks, simple to use (API calls).
- Disadvantages: Requires payment, data must be sent to third-party servers, privacy risks exist.

# Example: Using OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

Open-Source Models (e.g., Hugging Face):
- Representatives: sentence-transformers/all-mpnet-base-v2 (English general), bge-large-zh-v1.5 (Chinese), m3e-large (Chinese-English) etc.
- Advantages: Free, can be deployed locally, no data privacy leakage risk, numerous fine-tuned models available for specific languages or domains.
- Disadvantages: Requires self-management of model deployment and computational resources, performance may have some gap compared to top commercial models.
- MTEB Leaderboard: The Massive Text Embedding Benchmark (MTEB) is a public leaderboard for evaluating and comparing the performance of different embedding models, an important reference for selecting open-source models.

# Example: Using open-source models from Hugging Face
from langchain_huggingface import HuggingFaceEmbeddings
model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings_model = HuggingFaceEmbeddings(model_name=model_name)

Core Principle: Throughout the entire RAG process, the same embedding model must be used in both the indexing phase and the online retrieval phase. Otherwise, the query vectors and document vectors will exist in different vector spaces, making meaningful similarity comparisons impossible.

4. Retrieval Technology Deep Dive

Retrieval is the “heart” of RAG systems. Finding the most relevant contextual information is the prerequisite for generating high-quality answers. If the retrieved content is irrelevant or inaccurate, even the most powerful LLM will be ineffective - this is the so-called “Garbage In, Garbage Out” principle.

Retrieval technology has evolved from traditional keyword matching to modern semantic vector search, and has now developed various advanced strategies to address complex challenges in different scenarios.

4.1 Traditional Foundation: Sparse Retrieval

Sparse retrieval is a classic information retrieval method based on word frequency statistics, independent of deep learning models. Its core idea is that the more times a word appears in a specific document and the fewer times it appears across all documents, the more representative that word is for that document.

Representative Algorithms: TF-IDF & BM25 (Best Match 25)
Principle Brief (using BM25 as an example):
1. Term Frequency (TF): Calculate the frequency of each query term in the document.
2. Inverse Document Frequency (IDF): Measure the “rarity” of a term. Rarer terms have higher weights.
3. Document Length Penalty: Penalize overly long documents to prevent them from getting artificially high scores just because they contain more words.
Advantages:
- Precise Keyword Matching: Performs excellently for queries containing specific terms, abbreviations, or product models (like “iPhone 15 Pro”).
- Strong Interpretability: Score calculation logic is clear, easy to understand and debug.
- Fast Computation: No complex model inference required.
Disadvantages:
- Cannot Understand Semantics: Unable to handle synonyms, near-synonyms, or conceptual relevance. For example, searching for “Apple phone” won't match documents containing “iPhone”.
- “Vocabulary Gap” Problem: Relies on literal matching between queries and documents.
Applicable Scenarios: As part of hybrid retrieval, handling keyword and proper noun matching.

4.2 Modern Core: Dense Retrieval / Vector Search

Dense retrieval is the mainstream technology in current RAG systems. It uses deep learning models (the embedding models we discussed earlier) to encode the semantic information of text into dense vectors, enabling retrieval based on “semantic similarity” rather than “literal similarity”.

Core Idea: Semantically similar texts have vectors that are close to each other in multidimensional space.
Workflow:
1. Offline: Vectorize all document chunks and store them in a vector database.
2. Online: Vectorize the user query.
3. In the vector database, calculate the distance/similarity between the query vector and all document vectors (such as cosine similarity, Euclidean distance).
4. Return the Top-K document chunks with the closest distances.

4.2.1 Approximate Nearest Neighbor (ANN) Search

Since performing exact “nearest neighbor” searches among millions or even billions of vectors is extremely computationally expensive, the industry widely adopts Approximate Nearest Neighbor (ANN) algorithms. ANN sacrifices minimal precision in exchange for query speed improvements of several orders of magnitude.

Mainstream ANN Algorithm: HNSW (Hierarchical Navigable Small World)
HNSW Principle Brief: It constructs a hierarchical graph structure. In the higher-level graph, it performs rough, large-step searches to quickly locate the target area; then in the lower-level graph, it performs fine, small-step searches to finally find the nearest neighbor vectors. This is like finding an address in a city - first determining which district (higher level), then which street (lower level).
Advantages:
- Powerful Semantic Understanding: Can cross literal barriers to understand concepts and intentions.
- High Recall Rate: Can retrieve more semantically relevant documents with different wording.
Disadvantages:
- Keyword Insensitivity: Sometimes less effective than sparse retrieval for matching specific keywords or proper nouns.
- Strong Dependence on Embedding Models: Effectiveness completely depends on the quality of the embedding model.
- “Black Box” Problem: The process of generating and matching vectors is less intuitive than sparse retrieval.

4.3 Powerful Combination: Hybrid Search

Since sparse retrieval and dense retrieval each have their own strengths and weaknesses, the most natural idea is to combine them to leverage their respective advantages. Hybrid search was born for this purpose.

Implementation Method:
1. Parallel Execution: Simultaneously process user queries using sparse retrieval (like BM25) and dense retrieval (vector search).
2. Score Fusion: Obtain two sets of results and their corresponding scores.
3. Result Re-ranking: Use a fusion algorithm (such as Reciprocal Rank Fusion, RRF) to merge the two sets of results and re-rank them based on the fused scores to get the final Top-K results. The RRF algorithm gives higher weight to documents that rank high in different retrieval methods.

graph TD
subgraph "Hybrid Search"
A["User Query"] --> B["BM25 Retriever"];
A --> C["Vector Retriever"];
B --> D["Sparse Results (Top-K)"];
C --> E["Dense Results (Top-K)"];
D & E --> F{"Fusion & Reranking (e.g., RRF)"};
F --> G["Final Ranked Results"];
end

Advantages: Balances the precision of keyword matching and the breadth of semantic understanding, achieving better results than single retrieval methods in most scenarios.
Applicable Scenarios: Almost all RAG applications requiring high-quality retrieval.

4.4 Frontier Exploration: Advanced Retrieval Strategies

To address more complex query intentions and data structures, academia and industry have developed a series of advanced retrieval strategies.

4.4.1 Contextual Compression & Re-ranking

Problem: The Top-K document chunks returned by vector search may only partially contain content truly relevant to the question, and some high-ranking blocks might actually be “false positives.” Directly feeding this redundant or irrelevant information to the LLM increases noise and cost.

Solution: Add an intermediate “filtering” and “sorting” layer between retrieval and generation.

graph TD
A["Initial Retrieval"] --> B["Top-K Documents"];
B --> C{"Compressor / Re-ranker"};
UserQuery --> C;
C --> D["Filtered & Re-ranked Documents"];
D --> E["LLM Generation"];

Implementation Method: Using LangChain's ContextualCompressionRetriever.
- LLMChainExtractor: Uses an LLM to judge whether each document chunk is relevant to the query and only extracts relevant sentences.
- EmbeddingsFilter: Recalculates the similarity between query vectors and document chunk vectors, filtering out documents below a certain threshold.
- Re-ranker: This is currently the most effective and commonly used approach. It uses a lighter-weight cross-encoder model specifically trained to calculate relevance scores. Unlike the bi-encoder used in the retrieval phase (which encodes queries and documents separately), a cross-encoder receives both the query and document chunk as input simultaneously, enabling more fine-grained relevance judgment. Common re-rankers include Cohere Rerank, BAAI/bge-reranker-*, and models provided by open-source or cloud service vendors.

4.4.2 Self-Querying Retriever

Problem: User queries are typically in natural language but may contain filtering requirements for metadata. For example: “Recommend some science fiction movies released after 2000 with ratings above 8.5?”

Solution: Let the LLM itself “translate” natural language queries into structured query statements containing metadata filtering conditions.

Workflow:
1. User inputs a natural language query.
2. SelfQueryingRetriever sends the query to the LLM.
3. Based on predefined metadata field information (such as year, rating, genre), the LLM generates a structured query containing:
  - query: The keyword part for vector search (“science fiction movies”).
  - filter: Conditions for metadata filtering (year > 2000 AND rating > 8.5).
4. The retriever uses this structured query to perform a “filter first, then search” operation on the vector database, greatly narrowing the search scope and improving precision.

# Core settings for Self-Querying in LangChain
metadata_field_info = [
AttributeInfo(name="genre", ...),
AttributeInfo(name="year", ...),
AttributeInfo(name="rating", ...),
]
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
)

4.4.3 Multi-Vector Retriever

Problem: A single vector struggles to perfectly summarize a longer document chunk, especially when the chunk contains multiple subtopics.

Solution: Generate multiple vectors representing different aspects for each document chunk, rather than a single vector.

Implementation Methods:
1. Smaller Sub-chunks: Further split the original document chunk into smaller sentences or paragraphs, and generate vectors for these small chunks.
2. Summary Vectors: Use an LLM to generate a summary for each document chunk, then vectorize the summary.
3. Hypothetical Question Vectors: Use an LLM to pose several possible questions about each document chunk, then vectorize these questions.

During querying, the query vector matches with all these sub-vectors (sub-chunks, summaries, questions). Once a match is successful, what's returned is the complete original document chunk it belongs to. This leverages both the precision of fine-grained matching and ensures that the context provided to the final LLM is complete.

4.4.4 Parent Document Retriever

This is a common implementation of the multi-vector retriever. It splits documents into “parent chunks” and “child chunks.” Indexing and retrieval happen on the smaller “child chunks,” but what's ultimately returned to the LLM is the larger “parent chunk” that the child belongs to. This solves the “context loss” problem, ensuring that the LLM sees a more complete linguistic context when generating answers.

4.4.5 Graph RAG

Problem: Traditional RAG views knowledge as independent text blocks, ignoring the complex, web-like relationships between knowledge points.

Solution: Build the knowledge base into a Knowledge Graph, where entities are nodes and relationships are edges.

Workflow:
1. During querying, the system first identifies the core entities in the query.
2. It then explores neighboring nodes and relationships related to these entities in the graph, forming a subgraph containing rich structured information.
3. This subgraph information is linearized (converted to text) and provided to the LLM as context.
Advantages: Can answer more complex questions requiring multi-hop reasoning (e.g., “Who is A's boss's wife?"), providing deeper context than “text blocks.”
Implementation Case: Graphiti/Zep:
- Introduction: Graphiti is a temporal knowledge graph architecture designed specifically for LLM Agents, seamlessly integrating Neo4j's graph database capabilities with LLM's natural language processing abilities.
- Core Features:
  - Temporal Awareness: Each node and relationship carries timestamp attributes, enabling tracking of how entity states change over time.
  - Automatic Schema Inference: No need to predefine entity types and relationships; the system can automatically infer appropriate graph structures from conversations.
  - Multi-hop Reasoning: Supports complex relationship path queries, capable of discovering indirectly associated information.
- Application Scenarios: Particularly suitable for multi-turn dialogue systems requiring long-term memory and temporal reasoning, such as customer support, personal assistants, and other scenarios needing to “remember” user historical interactions.

4.4.6 Agentic RAG / Adaptive RAG

This is the latest evolutionary direction of RAG, endowing RAG systems with certain “thinking” and “decision-making” capabilities, allowing them to adaptively select the best retrieval strategy based on the complexity of the question.

Core Idea: Transform the traditional linear RAG process into a dynamic process driven by an LLM Agent that can loop and iterate.
Possible Workflow:
1. Question Analysis: The Agent first analyzes the user's question. Is this a simple question or a complex one? Does it need keyword matching or semantic search?
2. Strategy Selection:
  - If the question is simple, directly perform vector search.
  - If the question contains metadata, switch to Self-Querying.
  - If the question is ambiguous, the Agent might first rewrite the question (Query Rewriting), generating several different query variants and executing them separately.
3. Result Reflection & Iteration: The Agent examines the preliminary retrieved results. If the results are not ideal (e.g., low relevance or conflicting information), it can decide to:
  - Query Again: Use different keywords or strategies to retrieve again.
  - Web Search: If the internal knowledge base doesn't have an answer, it can call search engine tools to find information online.
  - Multi-step Reasoning: Break down complex questions into several sub-questions, retrieving and answering step by step.

Agentic RAG is no longer a fixed pipeline but a flexible, intelligent framework, representing the future direction of RAG development.

5. Generation Phase: The Final Touch

The generation phase is the endpoint of the RAG process and the ultimate manifestation of its value. In this phase, the system combines the “essence” context obtained from previous retrieval, filtering, and re-ranking with the user's original question to form a final prompt, which is then sent to the large language model (LLM) to generate an answer.

5.1 Core Task: Effective Prompt Engineering

The core task of this phase is Prompt Engineering. A well-designed prompt template can clearly instruct the LLM on its task, ensuring it thinks and answers along the right track.

A typical RAG prompt template structure is as follows:

You are a professional, rigorous Q&A assistant. Please answer the user's question based on the context information provided below.
Your answer must be completely based on the given context, and you are prohibited from using your internal knowledge for any supplementation or imagination.
If there is not enough information in the context to answer the question, please clearly state "Based on the available information, I cannot answer this question."
At the end of your answer, please list all the context source IDs you referenced.
---
[Context Information]
{context}
---
[User Question]
{question}
---
[Your Answer]

5.1.1 Template Key Elements Analysis

Persona: “You are a professional, rigorous Q&A assistant.” This helps set the tone and style of the LLM's output.
Core Instruction: “Please answer the user's question based on the context information provided below.” This is the most critical task instruction.
Constraints & Guardrails:
- “Must be completely based on the given context, prohibited from… supplementation or imagination.” -> This is key to suppressing model hallucinations.
- “If there is not enough information, please clearly state…” -> This defines the model's “escape route” when information is insufficient, preventing it from guessing.
Attribution/Citation: “Please list all the context source IDs you referenced.” -> This is the foundation for answer explainability and credibility.
Placeholders:
- {context}: This will be filled with the content of multiple document chunks (chunks) obtained from the retrieval phase, after processing.
- {question}: This will be filled with the user's original question.

5.2 Context and Question Fusion

When the system fills multiple document chunks (e.g., Top-5 chunks) into the {context} placeholder, these chunks are packaged together with the original question and sent to the LLM. The LLM reads the entire enhanced prompt and then:

Understands the Question: Clarifies the user's query intent.
Locates Information: Searches for sentences and paragraphs directly related to the question within the provided multiple context blocks.
Synthesizes & Refines: Integrates, understands, and refines scattered information points found from different context blocks.
Generates an Answer: Based on the refined information, generates a final answer using fluent, coherent natural language.
Cites Sources: According to instructions, includes the document sources that the answer is based on.

Through this carefully designed “open-book exam” process, the RAG system ultimately generates a high-quality answer that combines both the LLM's powerful language capabilities and fact-based information.

6. RAG Evaluation Framework: How to Measure System Quality?

Building a RAG system is just the first step. Scientifically and quantitatively evaluating its performance, and continuously iterating and optimizing based on this evaluation, is equally important. A good evaluation framework can help us diagnose whether the system's bottleneck is in the retrieval module (“not found”) or in the generation module (“not well expressed”).

Industry-leading RAG evaluation frameworks, such as RAGAS (RAG Assessment) and TruLens, provide a series of metrics to score RAG system performance from different dimensions.

6.1 Core Evaluation Dimensions

RAG evaluation can be divided into two levels: component level (evaluating retrieval and generation separately) and end-to-end level (evaluating the quality of the final answer).

graph TD
subgraph "RAG Evaluation Dimensions"
A("Evaluation") --> B["Component-Level Evaluation"];
A --> C["End-to-End Evaluation"];
B --> B1["Retriever Quality Evaluation"];
B --> B2["Generator Quality Evaluation"];
B1 --> B1_Metrics("Context Precision, Context Recall");
B2 --> B2_Metrics("Faithfulness");
C --> C_Metrics("Answer Relevancy, Answer Correctness");
end

6.2 Key Evaluation Metrics (Using RAGAS as an Example)

Below we explain in detail several core metrics in the RAGAS framework. These metrics do not require manually annotated reference answers (Reference-Free), greatly reducing evaluation costs.

6.2.1 Evaluating Generation Quality

Metric 1: Faithfulness

Definition: Measures the extent to which the generated answer is completely based on the provided context. High faithfulness means that every statement in the answer can find evidence in the context.
Evaluation Method: RAGAS uses an LLM to analyze the answer, breaking it down into a series of statements. Then, for each statement, it verifies in the context whether there is evidence supporting that statement. The final score is (number of statements supported by the context) / (total number of statements).
Problem Diagnosed: This metric is the core indicator for measuring “model hallucination”. A low score means the generator (LLM) is freely making up information that doesn't exist in the context.
Data Required: question, answer, context.

6.2.2 Evaluating Both Retrieval and Generation Quality

Metric 2: Answer Relevancy

Definition: Measures the relevance of the generated answer to the user's original question. An answer faithful to the context might still be off-topic.
Evaluation Method: RAGAS uses an Embedding model to measure the semantic similarity between the question and answer. It also uses an LLM to identify “noise” or irrelevant sentences in the answer and penalizes them.
Problem Diagnosed: A low score means that although the answer may be based on the context, it doesn't directly or effectively answer the user's question, or it contains too much irrelevant information.
Data Required: question, answer.

6.2.3 Evaluating Retrieval Quality

Metric 3: Context Precision

Definition: Measures how much of the retrieved context is truly relevant to the question - the “signal-to-noise ratio.”
Evaluation Method: RAGAS analyzes the context sentence by sentence and has an LLM judge whether each sentence is necessary for answering the user's question. The final score is (number of sentences deemed useful) / (total number of sentences in the context).
Problem Diagnosed: A low score (high 1 - Context Precision value) indicates that the retriever returned many irrelevant “noise” documents, which interferes with the generator's judgment and increases costs. This suggests that the retrieval algorithm needs optimization.
Data Required: question, context.

Metric 4: Context Recall

Definition: Measures whether the retrieved context contains all the necessary information to answer the question.
Evaluation Method: This metric requires a manually annotated reference answer (Ground Truth) as a benchmark. RAGAS has an LLM analyze this reference answer and judge whether each sentence in it can find support in the retrieved context.
Problem Diagnosed: A low score means the retriever failed to find key information needed to answer the question, indicating “missed retrievals.” This might suggest that the document chunking strategy is unreasonable, or the Embedding model cannot understand the query well.
Data Required: question, ground_truth (reference answer), context.

6.3 Using Evaluation to Guide Iteration

By comprehensively evaluating a RAG system using the above metrics, we can get a clear performance profile and make targeted optimizations:

Low Faithfulness Score: The problem is in the generator. Need to optimize the Prompt, add stronger constraints, or switch to an LLM with stronger instruction-following capabilities.
Low Answer Relevancy Score: The problem could be in either the generator or retriever. Need to check if the Prompt is guiding the model off-topic, or if the retrieved content is of poor quality.
Low Context Precision Score: The problem is in the retriever. Indicates that the recalled documents are of poor quality with much noise. Can try better retrieval strategies, such as adding a Re-ranker to filter irrelevant documents.
Low Context Recall Score: The problem is in the retriever. Indicates that key information wasn't found. Need to check if the Chunking strategy is fragmenting key information, or try methods like Multi-Query to expand the retrieval scope.

Through the “evaluate-diagnose-optimize” closed loop, we can continuously improve the overall performance of the RAG system.

7. Challenges and Future Outlook

Although RAG has greatly expanded the capabilities of large language models and has become the de facto standard for building knowledge-intensive applications, it still faces some challenges while also pointing to exciting future development directions.

7.1 Current Challenges

“Needle-in-a-Haystack” Problem: As LLM context windows grow larger (e.g., million-level tokens), precisely finding and utilizing key information in lengthy, noisy contexts becomes increasingly difficult. Research shows that LLM performance when processing long contexts is affected by the position of information within them, with issues like “middle neglect.”
Imperfect Chunking: How to optimally split documents remains an open question. Existing rule-based or simple semantic splitting methods may damage information integrity or introduce irrelevant context, affecting retrieval and generation quality.
Evaluation Complexity and Cost: Although frameworks like RAGAS provide automated evaluation metrics, building a comprehensive, reliable evaluation set still requires significant human effort. Especially in domains requiring fine judgment, machine evaluation results may differ from human perception.
Integration of Structured and Multimodal Data: Knowledge in the real world isn't just text. How to efficiently integrate tables, charts, images, audio, and other multimodal information, and enable RAG systems to understand and utilize them, is an actively explored area.
Production Environment Complexity: Deploying a RAG prototype to a production environment requires considering data updates, permission management, version control, cost monitoring, low-latency responses, and a series of engineering challenges.

7.2 Future Outlook

Smarter Indexing: Future indexing processes will no longer be simple “split-vectorize” operations. They will more deeply understand document structures, automatically build knowledge graphs, identify entities and relationships, generate multi-level, multi-perspective representations (such as summaries, questions), creating a richer, more queryable knowledge network.
Adaptive Retrieval: As demonstrated by Agentic RAG, future RAG systems will have stronger autonomy. They can dynamically decide whether to perform simple vector searches or execute complex multi-step queries, or even call external tools (such as search engines, calculators, APIs) to obtain information based on the specific situation of the question. Retrieval will evolve from a fixed step to a flexible, agent-driven process.
LLM as Part of RAG: As LLM capabilities strengthen, they will participate more deeply in every aspect of RAG. Not just in the generation phase, but also in indexing (generating metadata, summaries), querying (query rewriting, expansion), retrieval (as a re-ranker), and other phases, playing a core role.
End-to-End Optimization: Future frameworks may allow end-to-end joint fine-tuning of various RAG components (Embedding models, LLM generators, etc.), making the entire system highly optimized for a specific task or domain, rather than simply piecing together individual components.
Native Multimodal RAG: RAG will natively support understanding and retrieving content like images, audio, and video. Users can ask questions like “Find me that picture of ‘a cat playing piano’” and the system can directly perform semantic retrieval in multimedia databases and return results.

In summary, RAG is evolving from a relatively fixed “retrieve-augment-generate” pipeline to a more dynamic, intelligent, adaptive knowledge processing framework. It will continue to serve as the key bridge connecting large language models with the vast external world, continuously unleashing AI's application potential across various industries in the foreseeable future.

Model Context Protocol (MCP): A Standardized Framework for AI Capability Extension

Mon, 30 Jun 2025 08:00:00 +0000

1. Macro Introduction: Why Do We Need MCP Beyond Tool Calling?

In our previous document on general LLM tool calling, we revealed how LLMs can break their knowledge boundaries by calling external functions. This is a powerful programming paradigm, but it doesn't define a standardized set of communication rules. Each developer must decide for themselves how to organize APIs, manage tools, and handle data formats, leading to ecosystem fragmentation.

The Model Context Protocol (MCP) was born precisely to solve this problem. It doesn't aim to replace the general concept of tool calling, but rather builds a layer of standardized, pluggable, service-oriented protocol on top of it.

If “tool calling” is teaching a car how to “refuel” (use external capabilities), then MCP establishes standardized gas stations and fuel nozzle interfaces for the world. No matter what car you drive (different LLMs) or what fuel you need (different tools), as long as you follow the MCP standard, you can connect seamlessly and plug-and-play.

The core value of MCP lies in:

Standardization: Defines unified message formats and interaction patterns for communication between models and external tool services. Developers no longer need to customize tool integration solutions for each model or application.
Decoupling: Completely separates the implementation of tools (running on MCP servers) from their use (initiated by LLMs). Models don't need to know the internal code of tools, only how to communicate with them through the protocol.
Reusability: Once a tool or data source is encapsulated as an MCP server, it can be easily reused by any model or application that supports the MCP protocol, greatly improving development efficiency.
Discoverability: MCP makes tools service-oriented, laying the foundation for building tool marketplaces and enabling automatic discovery and orchestration of tools in the future.

In simple terms, MCP elevates scattered “function calls” to the level of “distributed service calls,” serving as a key infrastructure for building scalable, interoperable AI Agent ecosystems.

2. MCP Core Architecture: A Trinity Collaboration Model

The MCP architecture consists of three core components that interact through clearly defined protocols, forming a solid “trinity” collaboration model.

Model/Agent: The decision core. It is responsible for understanding user intent and generating requests that follow the MCP format to call external tools or access external resources.
MCP Client: The communication hub. It serves as a bridge between the model and MCP servers, parsing MCP requests generated by the model, communicating with the corresponding MCP servers through standardized transmission methods (such as Stdio, HTTP SSE), and handling returned results.
MCP Server: The capability provider. This is a separate process or service that encapsulates one or more tools or data sources and provides standardized access interfaces through the MCP protocol.

Below is a visual explanation of this architecture:

graph TD
subgraph Agent [Model/Agent]
A[LLM] -- Generates Request --> B(MCP XML Request);
end
subgraph Client [MCP Client]
C{Request Parser};
B -- Parse Request --> C;
end
subgraph LocalServer [MCP Server - Local]
D[Stdio Communication];
end
subgraph RemoteServer [MCP Server - Remote]
E[HTTP SSE Communication];
end
subgraph ServerCore [MCP Server Internal]
F[Protocol Processor] -- Execute Tool --> G[Tool/Resource Implementation];
end
C -- Route to Local --> D;
C -- Route to Remote --> E;
D -- Local Transport --> F;
E -- Remote Transport --> F;
G -- Return Result --> F;
F -- Protocol Return --> C;
C -- Submit Result --> A;
style A fill:#cde4ff,stroke:#333;
style B fill:#e6ffc2,stroke:#333;
style C fill:#fce8b2,stroke:#333;
style D fill:#f9c5b4,stroke:#333;
style E fill:#f9c5b4,stroke:#333;
style F fill:#d4a8e3,stroke:#333;
style G fill:#b4f9f2,stroke:#333;

Detailed Architecture Responsibilities:

Model Generates Request: When an LLM needs external capabilities, it no longer generates JSON for specific APIs, but instead generates an XML message that conforms to the MCP specification, such as <use_mcp_tool>. This message clearly specifies which server_name to communicate with and which tool_name to call.
Client Parsing and Routing: The MCP client (typically part of the model's runtime environment) captures and parses this XML request. It queries a service registry based on the server_name to determine whether the target server is a local process or a remote service.
Selecting Communication Channel:
- If the target is a local MCP server (e.g., a locally running Python script), the client will communicate with that server process through standard input/output (stdio).
- If the target is a remote MCP server (e.g., a service deployed in the cloud), the client will establish a connection with it through the HTTP Server-Sent Events (SSE) protocol.
Server Processing Request: After receiving the request, the protocol processor on the MCP server calls the specific tool function or resource handler that has been registered internally based on the tool_name or uri.
Execution and Return: The server executes the specific logic (calling APIs, querying databases, etc.) and encapsulates the results in the MCP standard format, returning them to the client through the same route.
Result Feedback to Model: After receiving the server's response, the client organizes and formats it as the execution result of the external tool, and submits it back to the LLM for the LLM to generate the final natural language reply, completing the entire interaction loop.

The brilliance of this architecture lies in the fact that the LLM itself is completely decoupled from the physical location and network implementation details of the tools. It only needs to learn to “speak” the MCP “common language” to interact with any service in the entire MCP ecosystem.

3. Communication Protocol Deep Dive: MCP's Neural Network

The power of MCP lies in its standardized communication methods. It primarily connects clients and servers through two distinctly different protocols to accommodate different deployment scenarios.

3.1. Local Communication: Standard Input/Output (Stdio)

When the MCP server is a local executable file or script (e.g., a Python script, a Go program), the MCP client uses Standard Input/Output (Stdio) for communication. This is a classic and efficient form of inter-process communication (IPC).

Workflow Breakdown:

Launch Subprocess: The MCP client (such as a VS Code extension) launches the MCP server program as a subprocess (e.g., executing python mcp_server.py).
Pipe Establishment: The operating system automatically establishes three pipes between the parent process (client) and child process (server):
- stdin (standard input): The channel for the client to send data to the server.
- stdout (standard output): The channel for the server to send successful results to the client.
- stderr (standard error): The channel for the server to send error messages to the client.
Message Exchange:
- The client writes the MCP request (e.g., an XML string like <use_mcp_tool>...) to the server process's stdin. To handle packet sticking issues, messages are typically delimited by specific separators (such as newline \n) or length prefixes.
- The server reads and parses the request from its stdout and executes the corresponding logic.
- The server writes the execution result (also an XML string in MCP format) to its own stdout.
- If any errors occur during the process, error details are written to stderr.
Lifecycle Management: The client is responsible for monitoring the lifecycle of the server subprocess and can terminate it when it's no longer needed.

Advantages:

Extremely Low Latency: Since it's local inter-process communication, there's almost no network overhead.
Simple and Reliable: Simple implementation, not dependent on the network stack.
High Security: Data doesn't leave the machine, providing natural isolation.

Applicable Scenarios:

Local tools requiring high performance and high-frequency calls.
Tools that directly operate on the local file system or hardware.
Development and debugging environments.

3.2. Remote Communication: Server-Sent Events (HTTP SSE)

When the MCP server is deployed on a remote host or in the cloud, communication is done through the HTTP-based Server-Sent Events (SSE) protocol. SSE is a web technology that allows servers to push events to clients in a one-way fashion.

Workflow Breakdown:

HTTP Connection: The MCP client initiates a regular HTTP GET request to a specific endpoint of the MCP server (e.g., https://api.my-mcp-server.com/v1/mcp). The key is that the client includes Accept: text/event-stream in the request header, indicating it wants to establish an SSE connection.
Long Connection Maintenance: Upon receiving the request, the server doesn't immediately close the connection but keeps it open, forming a long connection. The Content-Type header of the response is set to text/event-stream.
Event Pushing:
- The client sends the MCP request (XML string) as part of the HTTP POST request body to another endpoint of the server through this long connection.
- After processing the request, the server encapsulates the response data in the SSE event format and pushes it back to the client through the previously established long connection. Each event consists of fields such as event: <event_name> and data: <event_data>.
- MCP typically defines different types of events, such as result for success, error for failure, and log for transmitting logs.

Advantages:

Cross-Network Communication: Can easily connect to servers anywhere.
Firewall Penetration: Based on standard HTTP(S) protocol, with good network compatibility.
Server-Side Push: Suitable for scenarios requiring server-initiated notifications.

Applicable Scenarios:

Encapsulating third-party cloud service APIs (such as weather, maps, payments).
Shared tools that need centralized management and deployment.
Building publicly accessible tool service ecosystems.

4. MCP Message Format Breakdown: The Protocol's “Common Language”

The core of MCP is its XML-based message format that is both human-readable and machine-parsable. Models express their intentions by generating XML fragments in these specific formats.

4.1. `<use_mcp_tool>`: Calling a Tool

This is the most core message, used to request the execution of a defined tool.

Structure Example:

<use_mcp_tool>
<server_name>weather-server</server_name>
<tool_name>get_forecast</tool_name>
<arguments>
{
"city": "San Francisco",
"days": 5
}
</arguments>
</use_mcp_tool>

Field Details:

<server_name> (Required):
- Purpose: Unique identifier of the MCP server.
- Underlying Details: The client uses this name to look up corresponding server information (whether it's a local process or remote URL) in its internal service registry, deciding whether to use Stdio or SSE for communication. This is key to implementing routing.
<tool_name> (Required):
- Purpose: Name of the tool to call.
- Underlying Details: After receiving the request, the MCP server uses this name to find and execute the corresponding function in its internal tool mapping table.
<arguments> (Required):
- Purpose: Parameters needed to call the tool.
- Underlying Details: The content is typically a JSON string. The server needs to first parse this string, convert it to a language-native object or dictionary, and then pass it to the specific tool function. This design leverages JSON's powerful data expression capabilities and cross-language universality.

4.2. `<access_mcp_resource>`: Accessing a Resource

In addition to actively “executing” tools, MCP also supports passively “accessing” data sources.

Structure Example:

<access_mcp_resource>
<server_name>internal-docs</server_name>
<uri>doc://product/specs/version-3.md</uri>
</access_mcp_resource>

Field Details:

<server_name> (Required): Same as above, used for routing.
<uri> (Required):
- Purpose: Uniform Resource Identifier for the resource.
- Underlying Details: The format of the URI (scheme://path) is defined and interpreted by the server itself. For example:
  - file:///path/to/local/file: Access a local file.
  - db://customers/id/123: Query a database.
  - api://v1/users?active=true: Access a REST API endpoint. The server needs to parse this URI and execute the appropriate resource retrieval logic based on its scheme and path.

5. Building an MCP Server: From Concept to Code Skeleton

To make the concept more concrete, below is a minimalist Python pseudocode skeleton showing how to implement an MCP server that responds to Stdio communication.

import sys
import json
import xml.etree.ElementTree as ET
# 1. Define specific tool functions
def get_weather(city: str, days: int = 1):
"""A simulated weather tool"""
# In the real world, this would call a weather API
return {"city": city, "forecast": f"Sunny for the next {days} days"}
# Map tool names to function objects
AVAILABLE_TOOLS = {
"get_weather": get_weather
}
# 2. MCP protocol processing main loop
def main_loop():
"""Read requests from stdin, process them, and write results to stdout"""
for line in sys.stdin:
request_xml = line.strip()
if not request_xml:
continue
try:
# 3. Parse MCP request
root = ET.fromstring(request_xml)
if root.tag == "use_mcp_tool":
tool_name = root.find("tool_name").text
args_str = root.find("arguments").text
args = json.loads(args_str)
# 4. Find and execute the tool
tool_function = AVAILABLE_TOOLS.get(tool_name)
if tool_function:
result = tool_function(**args)
# 5. Encapsulate successful result and write back to stdout
response = {"status": "success", "data": result}
sys.stdout.write(json.dumps(response) + "\n")
else:
raise ValueError(f"Tool '{tool_name}' not found.")
# (Logic for handling access_mcp_resource can be added here)
except Exception as e:
# 6. Write error information back to stderr
error_response = {"status": "error", "message": str(e)}
sys.stderr.write(json.dumps(error_response) + "\n")
# Flush buffers in real-time to ensure the client receives immediately
sys.stdout.flush()
sys.stderr.flush()
if __name__ == "__main__":
main_loop()

This skeleton clearly demonstrates the core responsibilities of an MCP server: listening for input, parsing the protocol, executing logic, and returning results.

6. Practical Exercise: Using the MCP-Driven context7 Server to Answer Technical Questions

After theory and skeleton, let's look at a real, end-to-end example to see how MCP works in practical applications.

Scenario: We're building an AI programming assistant. When a user asks a specific programming question, we want the AI to provide the most authoritative and accurate answer by querying the latest official documentation, rather than relying on its potentially outdated internal knowledge.

In this scenario, the context7 MCP server is our “external document library.”

Here's the complete interaction flow:

sequenceDiagram
participant User
participant Agent as AI Programming Assistant (Model+Client)
participant Context7 as context7 MCP Server
User->>+Agent: Ask about React Hooks differences
Note over Agent: 1. Analyze question, decide to call tool
Agent-->>+Context7: 2. Send MCP request (get-library-docs)
Note over Context7: 3. Query document library
Context7-->>-Agent: 4. Return document summary (key differences)
Note over Agent: 5. Understand and summarize authoritative material
Agent-->>-User: 6. Generate final answer based on documentation

Process Breakdown and MCP Value Demonstration

Intent to Protocol Conversion: The model (LLM) successfully converts the user's natural language question into a structured, standardized MCP request. It not only identifies the need to call a tool but also accurately fills in the server_name, tool_name, and arguments, which is the core capability of an MCP-driven Agent.
Decoupling Advantage: The AI programming assistant (client) doesn't need to know at all how the context7 server is implemented. It could be a complex system connected to multiple data sources. But for the assistant, it's just a service endpoint that follows the MCP protocol and can be accessed through the name context7. This decoupling makes replacing or upgrading the document source extremely simple without needing to modify the Agent's core logic.
Scalability from Standardization: Now, if we want to add the ability to query NPM package dependencies to this AI assistant, we just need to develop or integrate another MCP server named npm-analyzer. The learning cost for the Agent is almost zero because it only needs to learn to generate a new <use_mcp_tool> request pointing to the new server_name. The entire system's capabilities can be infinitely expanded like building with Lego blocks.

This example clearly demonstrates how MCP evolves from a simple “function call” concept to a powerful, scalable service-oriented architecture, providing a solid foundation for building complex AI applications.

7. Conclusion: MCP's Value and Future—Building the “Internet” of AI

General tool calling gives LLMs the ability to “speak” and “act,” while the Model Context Protocol (MCP) defines the grammar and traffic rules for these abilities. Through standardization, decoupling, and service-oriented design principles, MCP transforms isolated AI applications and tools into a potential, interoperable massive network.

The true value of MCP isn't that it defines another type of RPC (Remote Procedure Call), but that it's specifically tailored for the unique scenario of AI Agent interaction with the external world. It's simple enough for LLMs to easily generate protocol messages, yet powerful enough to support complex, distributed application ecosystems.

In the future, as the MCP ecosystem matures, we can envision an “Internet of AI tools”:

Tool Marketplace: Developers can publish and sell standardized MCP servers, and other applications can purchase and integrate them as needed.
Agent Interoperability: Intelligent agents developed by different companies based on different underlying models can call each other's capabilities and collaborate on more complex tasks as long as they all “speak” the MCP language.
Dynamic Service Discovery: More advanced Agents might be able to dynamically discover and learn new MCP services, continuously expanding their capability boundaries without requiring reprogramming.

Therefore, understanding and mastering MCP is not just about learning a specific technology, but a key step in gaining insight into and planning for the next generation of AI application architecture.

LLM Tool Calling: The Key Technology Breaking AI Capability Boundaries

Mon, 30 Jun 2025 07:00:00 +0000

1. Macro Overview: Why Tool Calling is LLM's “Super Plugin”

The emergence of Large Language Models (LLMs) has fundamentally changed how we interact with machines. However, LLMs have an inherent, unavoidable “ceiling”: they are essentially “probability prediction machines” trained on massive text data, with their knowledge frozen at the time their training data ends. This means an LLM cannot know “what's the weather like today?", cannot access your company's internal database, and cannot book a flight ticket for you.

The LLM Tool Calling / Function Calling mechanism emerged precisely to break through this ceiling. It gives LLMs an unprecedented ability: calling external tools (APIs, functions, databases, etc.) to obtain real-time information, perform specific tasks, or interact with the external world when needed.

In simple terms, the tool calling mechanism upgrades LLMs from “knowledgeable conversationalists” to capable “intelligent agents.” It allows LLMs to:

Obtain real-time information: By calling weather APIs, news APIs, search engines, etc., to get the latest information beyond the model's training data.
Operate external systems: Connect to enterprise CRM/ERP systems to query data, or connect to IoT devices to control smart home appliances.
Execute complex tasks: Break down complex user instructions (like “help me find and book a cheap flight to Shanghai next week”) and complete them by calling multiple APIs in combination.
Provide more precise, verifiable answers: For queries requiring exact calculations or structured data, LLMs can call calculators or databases instead of relying on their potentially inaccurate internal knowledge.

Therefore, tool calling is not just a simple extension of LLM functionality, but a core foundation for building truly powerful AI applications that deeply integrate with both the physical and digital worlds.

2. Core Concepts and Workflow: How Do LLMs “Learn” to Use Tools?

To understand the underlying logic of tool calling, we need to view it as an elegant process involving three core roles working together:

Large Language Model (LLM): The brain and decision-maker.
Tool Definitions: A detailed “tool instruction manual.”
Developer/Client-side Code: The ultimate “executor.”

The LLM itself never actually executes any code. Its only task, after understanding the user's intent and the “tool manual” it has, is to generate a JSON data structure that precisely describes which tool should be called and with what parameters.

Below is a visual explanation of this process:

sequenceDiagram
participant User
participant Client as Client/Application Layer
participant LLM as Large Language Model
participant Tools as External Tools/APIs
User->>+Client: "What's the weather in Beijing today?"
Client->>+LLM: Submit user request + Tool Definitions
Note over LLM: 1. Understand user intent<br/>2. Match most appropriate tool (get_weather)<br/>3. Extract required parameters (location: "Beijing")
LLM-->>-Client: Return JSON: {"tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"location\": \"Beijing\"}"}}]}
Client->>+Tools: 2. Based on LLM's JSON, call the actual get_weather("Beijing") function
Tools-->>-Client: Return weather data (e.g.: {"temperature": "25°C", "condition": "sunny"})
Client->>+LLM: 3. Submit tool execution result back to LLM
Note over LLM: 4. Understand the data returned by the tool
LLM-->>-Client: 5. Generate user-friendly natural language response
Client->>-User: "The weather in Beijing today is sunny with a temperature of 25 degrees Celsius."

Process Breakdown:

Define & Describe:
- Developers first need to define available tools in a structured way (typically using JSON Schema). This “manual” is crucial to the entire process and must clearly tell the LLM:
  - Tool name (name): For example, get_weather.
  - Tool function description (description): For example, “Get real-time weather information for a specified city.” This is the most important basis for the LLM to understand the tool's purpose.
  - Tool parameters (parameters): Detailed definition of what inputs the tool needs, including each input's name, type (string, number, boolean, etc.), whether it's required, and parameter descriptions.
Intent Recognition & Parameter Extraction:
- When a user makes a request (e.g., “Check the weather in Beijing”), the developer's application sends the user's original request along with all the tool definitions from step 1 to the LLM.
- The LLM's core task is to do two things:
  - Intent Recognition: Among all available tools, determine which tool's function description best matches the user's request. In this example, it would match get_weather.
  - Parameter Extraction: From the user's request, identify and extract values that satisfy the tool's parameter requirements. Here, it would recognize that the location parameter value is “Beijing”.
- After completing these two steps, the LLM generates one or more tool_calls objects, essentially saying “I suggest you call the function named get_weather and pass in the parameter { "location": "Beijing" }”.
Execute & Observe:
- The developer's application code receives the JSON returned by the LLM and parses this “call suggestion.”
- The application code actually executes the get_weather("Beijing") function locally or on the server side.
- After execution, it gets a real return result, such as a JSON object containing weather information.
Summarize & Respond:
- To complete the loop, the application layer needs to submit the actual execution result from the previous step back to the LLM.
- This time, the LLM's task is to understand this raw data returned by the tool (e.g., {"temperature": "25°C", "condition": "sunny"}) and convert it into a fluent, natural, user-friendly response.
- Finally, the user receives the reply “The weather in Beijing today is sunny with a temperature of 25 degrees Celsius,” and the entire process is complete.

This process elegantly combines the LLM's powerful natural language understanding ability with the external tool's powerful functional execution capability, achieving a 1+1>2 effect.

3. Technical Deep Dive: Analyzing the Industry Standard (OpenAI Tool Calling)

OpenAI's API is currently the de facto standard in the field of LLM tool calling, and its design is widely emulated. Understanding its implementation details is crucial for any developer looking to integrate LLM tool calling into their applications.

3.1. Core API Parameters

When calling OpenAI's Chat Completions API, there are two main parameters related to tool calling: tools and tool_choice.

`tools` Parameter: Your “Toolbox”

The tools parameter is an array where you can define one or more tools. Each tool follows a fixed structure, with the core being a function object defined based on the JSON Schema specification.

Example: Defining a weather tool and a flight booking tool

[
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get real-time weather information for a specified location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state/province name, e.g., 'San Francisco, CA'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "book_flight",
"description": "Book a flight ticket for the user from departure to destination",
"parameters": {
"type": "object",
"properties": {
"departure": {
"type": "string",
"description": "Departure airport or city"
},
"destination": {
"type": "string",
"description": "Destination airport or city"
},
"date": {
"type": "string",
"description": "Desired departure date in YYYY-MM-DD format"
}
},
"required": ["departure", "destination", "date"]
}
}
}
]

Key Points Analysis:

type: Currently fixed as "function".
function.name: Function name. Must be a combination of letters, numbers, and underscores, not exceeding 64 characters. This is the key for your code to identify which function to call.
function.description: Critically important. This is the main basis for the LLM to decide whether to select this tool. The description should clearly, accurately, and unambiguously explain what the function does. A good description can greatly improve the LLM's call accuracy.
function.parameters: A standard JSON Schema object.
- type: Must be "object".
- properties: Defines each parameter's name, type (string, number, boolean, array, object), and description. The parameter description is equally important as it helps the LLM understand what information to extract from user input to fill this parameter.
- required: An array of strings listing which parameters are mandatory. If the user request lacks necessary information, the LLM might ask follow-up questions or choose not to call the tool.

`tool_choice` Parameter: Controlling the LLM's Choice

By default, the LLM decides on its own whether to respond with text or call one or more tools based on the user's input. The tool_choice parameter allows you to control this behavior more precisely.

"none": Forces the LLM not to call any tools and directly return a text response.
"auto" (default): The LLM can freely choose whether to respond with text or call tools.
{"type": "function", "function": {"name": "my_function"}}: Forces the LLM to call this specific tool named my_function.

This parameter is very useful in scenarios where you need to enforce a specific process or limit the LLM's capabilities.

3.2. Request-Response Lifecycle

A complete tool calling interaction involves at least two API requests.

First Request: From User to LLM

# request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Please book me a flight from New York to London tomorrow"}],
tools=my_tools, # The tool list defined above
tool_choice="auto"
)

First Response: LLM's “Call Suggestion”

If the LLM decides to call a tool, the API response's finish_reason will be tool_calls, and the message object will contain a tool_calls array.

{
"choices": [
{
"finish_reason": "tool_calls",
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "book_flight",
"arguments": "{\"departure\":\"New York\",\"destination\":\"London\",\"date\":\"2025-07-01\"}"
}
}
]
}
}
],
...
}

Key Points Analysis:

finish_reason: A value of "tool_calls" indicates that the LLM wants you to execute a tool call, rather than ending the conversation.
message.role: assistant.
message.tool_calls: This is an array, meaning the LLM can request multiple tool calls at once.
- id: A unique call ID. In subsequent requests, you'll need to use this ID to associate the tool's execution results.
- function.name: The function name the LLM suggests calling.
- function.arguments: A JSON object in string form. You need to parse this string to get the specific parameters needed to call the function.

Second Request: Returning Tool Results to the LLM

After executing the tool in your code, you need to send the results back to the LLM to complete the conversation. At this point, you need to construct a new messages list that includes:

The original user message.
The assistant message returned by the LLM in the previous step (containing tool_calls).
A new message with the tool role, containing the tool's execution results.

# message history
messages = [
{"role": "user", "content": "Please book me a flight from New York to London tomorrow"},
response.choices[0].message, # Assistant's 'tool_calls' message
{
"tool_call_id": "call_abc123", # Must match the ID from the previous step
"role": "tool",
"name": "book_flight",
"content": "{\"status\": \"success\", \"ticket_id\": \"TICKET-45678\"}" # Actual return value from the tool
}
]
# second request
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

Second Response: LLM's Final Reply

This time, the LLM will generate a natural language response for the user based on the tool's returned results.

{
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Great! I've booked your flight from New York to London for tomorrow. Your ticket ID is TICKET-45678."
}
}
],
...
}

With this, a complete tool calling cycle is finished.

4. Code Implementation: A Complete Python Example

Below is an end-to-end Python example using OpenAI's Python library to demonstrate how to implement a weather query feature.

import os
import json
from openai import OpenAI
from dotenv import load_dotenv
# --- 1. Initial Setup ---
load_dotenv() # Load environment variables from .env file
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# --- 2. Define Our Local Tool Functions ---
# This is a mock function; in a real application, it would call an actual weather API
def get_current_weather(location, unit="celsius"):
"""Get real-time weather information for a specified location"""
if "New York" in location:
return json.dumps({
"location": "New York",
"temperature": "10",
"unit": unit,
"forecast": ["sunny", "light breeze"]
})
elif "London" in location:
return json.dumps({
"location": "London",
"temperature": "15",
"unit": unit,
"forecast": ["light rain", "northeast wind"]
})
else:
return json.dumps({"location": location, "temperature": "unknown"})
# --- 3. Main Execution Flow ---
def run_conversation(user_prompt: str):
print(f"👤 User: {user_prompt}")
# Step 1: Send the user's message and tool definitions to the LLM
messages = [{"role": "user", "content": user_prompt}]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get real-time weather information for a specified city",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., New York City",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Step 2: Check if the LLM decided to call a tool
if tool_calls:
print(f"🤖 LLM decided to call tool: {tool_calls[0].function.name}")
# Add the LLM's reply to the message history
messages.append(response_message)
# Step 3: Execute the tool call
# Note: This example only handles the first tool call
tool_call = tool_calls[0]
function_name = tool_call.function.name
function_to_call = globals().get(function_name) # Get the function from the global scope
if not function_to_call:
print(f"❌ Error: Function {function_name} is not defined")
return
function_args = json.loads(tool_call.function.arguments)
# Call the function and get the result
function_response = function_to_call(
location=function_args.get("location"),
unit=function_args.get("unit"),
)
print(f"🛠️ Tool '{function_name}' returned: {function_response}")
# Step 4: Return the tool's execution result to the LLM
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
)
print("🗣️ Submitting tool result back to LLM, generating final response...")
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
final_response = second_response.choices[0].message.content
print(f"🤖 LLM final response: {final_response}")
return final_response
else:
# If the LLM didn't call any tools, directly return its text content
final_response = response_message.content
print(f"🤖 LLM direct response: {final_response}")
return final_response
# --- Run Examples ---
if __name__ == "__main__":
run_conversation("What's the weather like in London today?")
print("\n" + "="*50 + "\n")
run_conversation("How are you?")

This example clearly demonstrates the entire process from defining tools, sending requests, handling tool_calls, executing local functions, to sending results back to the model to get the final answer.

5. Advanced Topics and Best Practices

After mastering the basic process, we need to understand some advanced usage and design principles to build more robust and reliable tool calling systems.

5.1. Parallel Tool Calling

Newer models (like gpt-4o) support parallel tool calling. This means the model can request multiple different, independent tools to be called in a single response.

Scenario Example: User asks: “What's the weather like in New York and London today?”

The model might return a response containing two tool_calls:

get_current_weather(location="New York")
get_current_weather(location="London")

Your code needs to be able to iterate through each tool_call object in the message.tool_calls array, execute them separately, collect all results, and then submit these results together in a new request to the model.

Code Handling Logic:

# ... (received response_message containing multiple tool_calls)
messages.append(response_message) # Add assistant's reply to messages
# Execute functions for each tool call and collect results
tool_outputs = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
output = function_to_call(**function_args)
tool_outputs.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": output,
})
# Add all tool outputs to the message history
messages.extend(tool_outputs)
# Call the model again
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

5.2. Error Handling

Tool calls are not always successful. APIs might time out, databases might be unreachable, or the function execution itself might throw exceptions. Gracefully handling these errors is crucial.

When a tool execution fails, you should catch the exception and return structured information describing the error as the result of the tool call to the LLM.

Example:

try:
# Try to call the API
result = some_flaky_api()
content = json.dumps({"status": "success", "data": result})
except Exception as e:
# If it fails, return error information
content = json.dumps({"status": "error", "message": f"API call failed: {str(e)}"})
# Return the result (whether successful or failed) to the LLM
messages.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": content,
})

When the LLM receives error information, it typically responds to the user with an apologetic answer that reflects the problem (e.g., “Sorry, I'm currently unable to retrieve weather information. Please try again later.") rather than causing the entire application to crash.

5.3. Designing Effective Tool Descriptions

The quality of the tool description (description) directly determines the LLM's call accuracy.

Clear and Specific: Avoid using vague terms.
- Bad: “Get data”
- Good: “Query the user's order history from the company's CRM system based on user ID”
Include Key Information and Limitations: If the tool has specific limitations, be sure to mention them in the description.
- Example: “Query flight information. Note: This tool can only query flights within the next 30 days and cannot query historical flights.”
Start with a Verb: Use a clear verb to describe the core functionality of the function.
Clear Parameter Descriptions: The description of parameters is equally important; it guides the LLM on how to correctly extract information from user conversations.
- Bad: "date": "A date"
- Good: "date": "Booking date, must be a string in YYYY-MM-DD format"

5.4. Security Considerations

Giving LLMs the ability to call code is a double-edged sword and must be handled with caution.

Never Execute Code Generated by LLMs: The LLM's output is a “call suggestion,” not executable code. Never use eval() or similar methods to directly execute strings generated by LLMs. You should parse the suggested function name and parameters, then call your pre-defined, safe, and trusted local functions.
Confirmation and Authorization: For operations with serious consequences (like deleting data, sending emails, making payments), implement a confirmation mechanism before execution. This could be forcing user confirmation at the code level or having the LLM generate a confirmation message after generating the call suggestion.
Principle of Least Privilege: Only provide the LLM with the minimum tools necessary to complete its task. Don't expose your entire codebase or irrelevant APIs.

6. Conclusion and Future Outlook

LLM tool calling is one of the most breakthrough advances in artificial intelligence in recent years. It transforms LLMs from closed “language brains” into open, extensible “intelligent agent” cores capable of interacting with the world. By combining the powerful natural language understanding capabilities of LLMs with the unlimited functionality of external tools, we can build unprecedented intelligent applications.

From querying weather and booking hotels to controlling smart homes, analyzing corporate financial reports, and automating software development processes, tool calling is unlocking countless possibilities. As model capabilities continue to strengthen, tool description understanding will become more precise, multi-tool coordination will become more complex and intelligent, and error handling and self-correction capabilities will become stronger.

In the future, we may see more complex Agentic architectures where LLMs not only call tools but can dynamically create, combine, and even optimize tools. Mastering the principles and practices of LLM tool calling is not only an essential skill to keep up with the current AI technology wave but also a key to future intelligent application development.

TensorRT In-Depth: High-Performance Deep Learning Inference Engine

Mon, 30 Jun 2025 06:00:00 +0000

1. Introduction

NVIDIA® TensorRT™ is a software development kit (SDK) for high-performance deep learning inference on NVIDIA GPUs. It is designed to optimize and accelerate trained neural networks, enabling them to run in production environments with low latency and high throughput. TensorRT takes models from mainstream deep learning frameworks (such as TensorFlow, PyTorch, ONNX, etc.), applies a series of sophisticated optimization techniques, and generates a highly optimized runtime engine.

This document will provide an in-depth yet accessible introduction to TensorRT's core concepts, key features, workflow, and latest functionalities (including TensorRT-LLM specifically designed for accelerating large language models), helping developers fully leverage its powerful performance advantages.

2. Core Concepts

Understanding TensorRT's core components is the first step to using it effectively.

Engine: The core of TensorRT. It is an optimized model representation that includes a computation graph and weights generated for a specific GPU architecture and configuration (such as batch size, precision). The Engine is immutable and is the final product for deployment.
Builder (IBuilder): This is the main interface for creating an Engine. The Builder takes a network definition and applies various optimizations, ultimately generating an optimized plan for the target GPU, which can be serialized into an Engine.
Network Definition (INetworkDefinition): This is where you define the model structure. You can build the network manually from scratch or import it from a model file using a Parser.
Parser: Used to parse models from different frameworks (primarily ONNX format) and convert them into TensorRT's network definition. TensorRT provides a powerful ONNX parser.
Profiler (IProfiler): An optional interface that allows you to collect and query information about layer performance during the build process. This helps with debugging and understanding which layers are performance bottlenecks.
Execution Context (IExecutionContext): This is the main interface for executing inference. An Engine can have multiple Execution Contexts, allowing concurrent execution of inference tasks. Each context maintains its own inputs, outputs, and state.

graph TD;
subgraph "Model Building Offline"
A[Original Model<br>TensorFlow/PyTorch] --> B{ONNX Parser};
B --> C[Network Definition];
C --> D[Builder];
D -- Optimization Config --> E[Optimized Plan];
E --> F((Engine));
end
subgraph "Inference Deployment Online"
F --> G[Execution Context];
H[Input Data] --> G;
G --> I[Output Results];
end
style F fill:#f9f,stroke:#333,stroke-width:2px
style G fill:#ccf,stroke:#333,stroke-width:2px

3. Key Features and Optimization Techniques

TensorRT's high performance stems from its advanced optimization techniques.

3.1. Precision Calibration & Quantization

TensorRT supports multiple precisions for inference, including FP32, FP16, INT8, and the latest FP8. Among these, INT8 quantization is a key technology for improving performance and reducing memory usage.

Post-Training Quantization (PTQ): Determines the scaling factors needed to convert FP32 weights and activation values to INT8 through a calibration dataset, without retraining the model.
Quantization-Aware Training (QAT): Simulates quantization operations during training, making the model more robust to quantization errors, thus achieving higher accuracy when converted to INT8.

You can use QuantizationSpec to precisely control which layers or types of layers need to be quantized.

# Example: Only quantize 'Conv2D' type layers
q_spec = QuantizationSpec()
q_spec.add(name='Conv2D', is_keras_class=True)
q_model = quantize_model(model, quantization_mode='partial', quantization_spec=q_spec)

3.2. Layer & Tensor Fusion

TensorRT intelligently merges multiple independent layers into a single, more complex layer. This reduces the number of CUDA kernel launches and memory reads/writes, significantly lowering latency.

Vertical Fusion: Merges consecutive layers with the same data dependencies (such as Conv, Bias, ReLU) into a single CBR layer.

graph TD;
subgraph "Before Fusion"
A[Input] --> B(Conv);
B --> C(Bias);
C --> D(ReLU);
D --> E[Output];
end
subgraph "After Fusion"
A2[Input] --> F((Conv + Bias + ReLU));
F --> E2[Output];
end

Horizontal Fusion: Merges parallel layers that have the same input but perform different operations.

graph TD;
subgraph "Before Fusion"
A[Input] --> B(Conv A);
A --> C(Conv B);
B --> D[Output A];
C --> E[Output B];
end
subgraph "After Fusion"
A2[Input] --> F((Conv A + Conv B));
F --> D2[Output A];
F --> E2[Output B];
end

3.3. Kernel Auto-Tuning

For specific target GPU architectures, TensorRT selects the optimal CUDA kernel for each layer from a library containing multiple implementations. It tests different algorithms and implementations based on the current batch size, input dimensions, and parameters to find the fastest one.

3.4. Dynamic Shapes

TensorRT can handle models with input tensor dimensions that vary at runtime. When building an Engine, you can specify an optimization profile that includes minimum, optimal, and maximum dimensions for inputs. TensorRT will generate an Engine that can efficiently handle any input dimensions within the specified range.

3.5. Plugins

For custom or special layers not natively supported by TensorRT, you can implement your own logic through the plugin API (IPluginV2). This provides great extensibility for TensorRT.

The latest versions of TensorRT have greatly simplified the plugin registration process through decorators, especially for the Python API.

# Example: Register a simple element-wise addition plugin
import tensorrt.plugin as trtp
@trtp.register("sample::elemwise_add_plugin")
def add_plugin_desc(inp0: trtp.TensorDesc, block_size: int) -> trtp.TensorDesc:
return inp0.like()

3.6. Sparsity

TensorRT supports leveraging structured sparsity features on NVIDIA Ampere and higher architecture GPUs. If your model weights have a 2:4 sparsity pattern, TensorRT can utilize sparse tensor cores to further accelerate computation, nearly doubling performance.

4. Workflow

A typical TensorRT deployment workflow is as follows:

sequenceDiagram
participant D as Developer
participant TF as TensorFlow/PyTorch
participant ONNX
participant Poly as Polygraphy
participant TRT as TensorRT (trtexec/API)
participant App as Application
D->>TF: Train Model
TF-->>D: Generate Trained Model
D->>ONNX: Export to ONNX Format
ONNX-->>D: .onnx File
D->>Poly: Use Polygraphy to Check and Optimize
Poly-->>D: Optimized .onnx File
D->>TRT: Build Engine (FP16/INT8)
TRT-->>D: Generate .engine File
D->>App: Deploy Engine
App->>App: Load Engine and Create Execution Context
loop Inference Loop
App->>App: Prepare Input Data
App->>App: Execute Inference
App->>App: Get Output Results
end

Model Export: Export your trained model from your training framework (such as PyTorch or TensorFlow) to ONNX format. ONNX is an open model exchange format that serves as a bridge between training and inference.
Model Inspection and Optimization (Polygraphy): Before building an Engine, it is strongly recommended to use the Polygraphy toolkit to inspect, modify, and optimize your ONNX model. Polygraphy is a powerful tool that can:
- Inspect Models: Display information about the model's layers, inputs, outputs, etc.
- Constant Folding: Pre-compute constant expressions in the model, simplifying the computation graph.
```
polygraphy surgeon sanitize model.onnx -o folded.onnx --fold-constants
```
- Compare Outputs from Different Frameworks: Verify that TensorRT's output is consistent with the original framework (such as ONNX Runtime) to troubleshoot precision issues.
```
polygraphy run model.onnx --trt --onnxrt
```
- Handle Data-Dependent Shapes (DDS): Identify and set upper bounds for tensors with data-dependent shapes.
Build Engine: Use the trtexec command-line tool or TensorRT's C++/Python API to build an Engine.
- trtexec: A convenient command-line tool for quickly building an Engine from an ONNX file and conducting performance benchmarking.
```
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
```
- API: Provides more flexible control, such as defining optimization profiles for dynamic shapes, configuring plugins, etc.

Deployment and Inference: Load the serialized Engine file into your application and use an Execution Context to perform inference.

# Using Polygraphy's TrtRunner for inference
from polygraphy.backend.trt import TrtRunner, EngineFromBytes
# Load Engine
engine = EngineFromBytes(open("model.engine", "rb").read())
with TrtRunner(engine) as runner:
# Prepare input data
feed_dict = {"input_name": input_data}
# Execute inference
outputs = runner.infer(feed_dict=feed_dict)

5. Latest Feature Highlights

TensorRT is rapidly iterating, and here are some of the latest important features:

Polygraphy Tool Enhancements:
- Simplified CLI Syntax: Allows specifying both script and function name in a single parameter (my_script.py:my_func).
- Improved Input Specification: Uses a new list-style syntax (--input-shapes input0:[x,y,z]) to avoid ambiguity.
Quickly Deployable Plugins:
- The Python API has introduced `@trtp.registerand@trt.plugin.autotune` decorators, making it unprecedentedly simple to define, register, and auto-tune plugins without writing C++ code.
CUDA Graphs:
- Through the --use-cuda-graph flag, TensorRT can leverage CUDA Graphs to capture the entire inference process, further reducing CPU overhead and kernel launch latency, particularly suitable for scenarios with fixed model structures.
FP8 Support:
- On Hopper and higher architecture GPUs, TensorRT supports FP8 inference, providing higher performance and lower memory usage for large language models and other applications.

6. Appendix: Common Commands

Install Polygraphy:

python3 -m pip install polygraphy --extra-index-url https://pypi.ngc.nvidia.com

Build and Install TensorRT Open Source Components:
```
# From source directory
make install
```
Run pytest Tests:
```
pytest --verbose
```

7. TensorRT-LLM: Born for Large Language Model Inference

As the scale and complexity of large language models (LLMs) grow exponentially, traditional inference optimization methods face unprecedented challenges. To address these challenges, NVIDIA has introduced TensorRT-LLM, an open-source library specifically designed to accelerate and optimize LLM inference. It is built on top of TensorRT and encapsulates a series of cutting-edge optimization techniques for LLMs.

7.1. What is TensorRT-LLM?

TensorRT-LLM can be thought of as an “LLM expert version” of TensorRT. It provides a Python API that allows developers to easily define LLM models and automatically apply various state-of-the-art optimizations. Ultimately, it generates a high-performance TensorRT engine that can be directly deployed.

Unlike general TensorRT which mainly handles static graphs, TensorRT-LLM specifically addresses the dynamic characteristics in LLM inference, such as:

Autoregressive Generation: Each newly generated token depends on the previous tokens, resulting in dynamically changing input sequence lengths.
Enormous Model Scale: Model parameters often number in the billions or even hundreds of billions, making it impossible to deploy on a single GPU.
Massive KV Cache: The inference process requires storing a large number of key-value pairs (Key-Value Cache), placing extremely high demands on memory bandwidth and capacity.

7.2. Core Architecture and Components

TensorRT-LLM's architecture is divided into frontend and backend:

Python API (tensorrt_llm): This is the main interface for user interaction. It defines models in a declarative way (similar to PyTorch), allowing developers to avoid dealing with the complex underlying TensorRT C++ API.
C++ Backend: This is the core that actually performs the optimization, containing pre-written, highly optimized CUDA kernels, LLM-specific optimization passes, and a runtime that can efficiently handle LLM tasks.

graph TD;
subgraph "Frontend (Python API)"
A[Hugging Face / Custom Model] -->|Weights| B(Model Definition<br>tensorrt_llm.Module);
B --> C{Builder};
C -- Generate Network and Config --> D[Network Definition];
end
subgraph "Backend (C++ Runtime)"
D --> E[TensorRT-LLM Optimization];
E --> F((LLM Optimized Engine));
end
subgraph "Inference"
F --> G[C++/Python Runtime];
H[Input Prompts] --> G;
G --> I[Output Tokens];
end
style F fill:#c9f,stroke:#333,stroke-width:2px

7.3. Key Optimization Techniques (LLM-Specific)

The magic of TensorRT-LLM lies in its optimization techniques specifically designed for LLMs.

7.3.1. In-Flight Batching (also known as Continuous Batching)

Problem: Traditional static batching requires all requests to wait until a batch is formed before processing them together. Due to the varying generation lengths of each request, this leads to significant GPU idle time (“bubbles”), as the batch must wait for the slowest request to complete.

Solution: In-Flight Batching allows the server to dynamically add new requests while the GPU is running. Once a request completes, its computational resources are immediately released and allocated to new requests in the waiting queue. This greatly improves GPU utilization and overall system throughput.

gantt
title GPU Utilization Comparison
dateFormat X
axisFormat %S
section Static Batching
Request A: 0, 6
Request B: 0, 3
Request C: 0, 5
GPU Waiting : 3, 3
GPU Waiting : 5, 1
section In-Flight Batching
Request A : 0, 6
Request B : 0, 3
Request C : 0, 5
New Request D : 3, 4

7.3.2. Paged KV Cache & Attention

Problem: In the autoregressive generation process, the KV cache grows linearly with sequence length, consuming large amounts of GPU memory. The traditional approach is to pre-allocate a continuous memory block for each request that can accommodate the maximum sequence length, leading to severe memory fragmentation and waste.

Solution: Inspired by operating system virtual memory paging, TensorRT-LLM introduced Paged KV Cache. It divides the KV cache into fixed-size “blocks” and allocates them as needed.

Non-contiguous Storage: KV caches for logically continuous tokens can be stored in physically non-contiguous blocks.
Memory Sharing: For complex scenarios (such as parallel sampling, Beam Search), different sequences can share the same KV cache blocks (e.g., sharing the cache for the prompt portion), significantly saving memory.
Optimized Attention Kernels: TensorRT-LLM uses specially optimized Attention kernels such as FlashAttention and MQA/GQA that can directly operate on these non-contiguous cache blocks, avoiding data copy overhead.

7.3.3. Tensor & Pipeline Parallelism

For large models that cannot fit on a single GPU, TensorRT-LLM has built-in seamless support for tensor parallelism and pipeline parallelism. Developers only need to specify the parallelism degree (tp_size, pp_size) during building, and TensorRT-LLM will automatically handle model splitting and cross-GPU communication.

# Example: Build a Llama model with 2-way tensor parallelism
python3 examples/llama/convert_checkpoint.py \
--model_dir ./llama-7b-hf \
--output_dir ./tllm_checkpoint_tp2 \
--dtype float16 \
--tp_size 2

7.3.4. Advanced Quantization Support (FP8/INT4/INT8)

The enormous parameter count of LLMs makes them ideal candidates for quantization. TensorRT-LLM supports various advanced quantization schemes:

FP8: On NVIDIA Hopper and higher architecture GPUs, FP8 provides precision close to FP16 while significantly improving performance and reducing memory usage.
INT8 SmoothQuant: A technique that quantizes both activations and weights, achieving INT8 acceleration while maintaining high precision.
INT4/INT8 Weight-Only Quantization (W4A16/W8A16): This is a very popular technique that only quantizes model weights (the largest part of parameters) to INT4 or INT8, while keeping activations in FP16. This greatly reduces memory usage with minimal impact on accuracy.

# Example: Build a model with INT4 weight-only quantization
python convert_checkpoint.py --model_dir ./gpt-j-6b \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4 \
--output_dir ./trt_ckpt/gptj_int4wo_tp1/

7.4. TensorRT-LLM Workflow

A typical TensorRT-LLM workflow is as follows:

sequenceDiagram
participant D as Developer
participant HF as Hugging Face Hub
participant Conv as convert_checkpoint.py
participant Build as trtllm-build
participant App as Inference Application (Python/C++)
D->>HF: Download Model Weights
HF-->>D: model_dir
D->>Conv: Run Conversion Script (Specify Precision, Parallelism, etc.)
Conv-->>D: Generate TensorRT-LLM Checkpoint
D->>Build: Run Build Command (Specify Plugins, BatchSize, etc.)
Build-->>D: Generate Optimized .engine File
D->>App: Load Engine and Run Inference
App-->>D: Return Generation Results

End-to-End Example (Using Llama-7B):

Convert Weights:

git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
python3 examples/llama/convert_checkpoint.py \
--model_dir ./Llama-2-7b-hf \
--output_dir ./tllm_checkpoint_1gpu \
--dtype float16

Build Engine:

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
--output_dir ./trt_engines/llama_7b \
--gpt_attention_plugin float16 \
--gemm_plugin float16

Run Inference:

python3 examples/run.py --max_output_len=100 \
--tokenizer_dir ./Llama-2-7b-hf \
--engine_dir=./trt_engines/llama_7b

7.5. Convenient High-Level API (`LLM`)

To further simplify the development process, TensorRT-LLM provides a high-level API called LLM. This interface encapsulates model loading, building, saving, and inference into a simple class, allowing developers to complete all operations in just a few lines of code.

from tensorrt_llm import LLM
# 1. Initialize LLM object, if the engine doesn't exist, it will automatically build from HuggingFace model
# All optimizations like In-Flight Batching, Paged KV-Cache will be applied here
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
)
# 2. (Optional) Save the built engine for later use
llm.save("llama_engine_dir")
# 3. Run inference
prompt = "NVIDIA TensorRT-LLM is"
for output in llm.generate([prompt], max_new_tokens=50):
print(output)

This high-level API is ideal for rapid prototyping and deployment.

7.6. Conclusion

TensorRT-LLM is not simply applying TensorRT to LLMs, but a comprehensive solution fundamentally redesigned for LLM inference, containing multiple state-of-the-art optimizations. Through In-Flight Batching, Paged KV-Cache, native parallel support, and advanced quantization schemes, it can maximize the hardware performance of NVIDIA GPUs, providing a solid foundation for deploying high-performance, high-throughput LLM services.

RAG Data Augmentation Techniques: Key Methods for Bridging the Semantic Gap

Sat, 28 Jun 2025 16:00:00 +0000

1. Introduction: Why RAG Needs Data Augmentation?

1.1 Understanding the “Semantic Gap”

The core of Retrieval-Augmented Generation (RAG) lies in the “retrieval” component. However, in practical applications, the retrieval step often becomes the bottleneck of the entire system. The root cause is the “Semantic Gap” or “Retrieval Mismatch”.

Specifically, this problem manifests in:

Diversity and Uncertainty of User Queries: Users ask questions in countless ways, potentially using colloquial language, abbreviations, typos, or describing the same issue from different angles.
Fixed and Formal Nature of Knowledge Base Documents: Documents in knowledge bases are typically structured and formal, with relatively fixed terminology.

This leads to a situation where the user's query vector and the document chunk vectors in the knowledge base may be far apart in vector space, even when they are semantically related.

For example:

Knowledge Base Document: # ThinkPad X1 Carbon Cooling Guide\n\nIf your ThinkPad X1 Carbon is experiencing overheating issues, you can try cleaning the fan, updating the BIOS, or selecting balanced mode in power management...
Possible User Queries:
- “My laptop is too hot, what should I do?”
- “Is my Lenovo laptop fan noise due to overheating?” (Even though the brand doesn't exactly match, the issue is essentially similar)
- “Computer gets very hot, games are lagging”
- “How can I cool down my ThinkPad?”

In a standard RAG workflow, these queries might fail to accurately retrieve the cooling guide mentioned above because their literal expressions and vector representations are too different.

1.2 Standard RAG Workflow

To better understand the problem, let's first look at the standard RAG workflow.

graph TD
A[User Input Query] --> B{Encoder};
B --> C[Query Vector];
C --> D{Vector Database};
E[Knowledge Base Documents] --> F{Encoder};
F --> G[Document Chunk Vectors];
G --> D;
D -- Vector Similarity Search --> H[Top-K Relevant Document Chunks];
A --> I((LLM));
H --> I;
I --> J[Generate Final Answer];
style A fill:#f9f,stroke:#333,stroke-width:2px
style J fill:#ccf,stroke:#333,stroke-width:2px

Figure 1: Standard RAG System Workflow

As shown above, the entire retrieval process heavily relies on the similarity between the Query Vector and Chunk Vectors. If there is a “semantic gap” between them, the retrieval effectiveness will be significantly reduced.

The core objective of Data Augmentation/Generalization is to proactively generate a large number of potential, semantically equivalent but expressively diverse “virtual queries” or “equivalent descriptions” for each document chunk in the knowledge base, thereby preemptively bridging this gap on the knowledge base side.

2. LLM-Based Data Augmentation/Generalization Techniques: Deep Dive into Details

Leveraging the powerful language understanding and generation capabilities of Large Language Models (LLMs) is the most efficient and mainstream approach to data augmentation/generalization. The core idea is: Let the LLM play the role of users and generate various possible questions and expressions for each knowledge chunk.

There are two main technical implementation paths: Hypothetical Questions Generation and Summarization & Paraphrasing.

2.1 Technical Path One: Hypothetical Questions Generation

This is the most direct and effective method. For each document chunk in the knowledge base, we have the LLM generate a set of questions that can be answered by this document chunk.

Technical Implementation Details:

Document Chunking: First, split the original document into meaningful, appropriately sized knowledge chunks. This is the foundation of all RAG systems.
Generate Questions for Each Chunk:
- Iterate through each chunk.
- Feed the content of the chunk as context to an LLM.
- Use a carefully designed prompt (see Chapter 3) to instruct the LLM to generate N questions closely related to the chunk's content.
Data Organization and Indexing:
- Key Step: Associate the N generated questions with the original chunk. When vectorizing, don't vectorize the questions themselves, but process each generated “question-original text pair”. A common approach is to concatenate the question and original text when vectorizing, or associate the question as metadata with the original chunk's vector during indexing.
- A more common practice is to store both the vectors of the generated questions and the vector of the original chunk in the vector database, all pointing to the same original chunk ID. This way, when a user queries, whether they match the original chunk or one of the generated questions, they can ultimately locate the correct original text.
Store in Vector Database: Store the processed data (original chunk vectors, generated question vectors) and their metadata (such as original ID) in a vector database (like ChromaDB, Milvus, Qdrant, etc.).

Workflow Diagram:

graph TD
subgraph "Offline Processing"
A[Original Document] --> B(Chunking);
B --> C{Iterate Each Chunk};
C --> D[LLM Generator];
D -- "Generate for Chunk n" --> E[Generated Multiple Questions];
Chunk_n --> F{Encoder};
F --> G[Vector of Chunk_n];
G -- "Points to Chunk_n ID" --> H((Vector Database));
E --> I{Encoder};
I --> J[Vectors of All Generated Questions];
J -- "All Point to Chunk_n ID" --> H;
subgraph "Original Knowledge"
direction LR
Chunk_n(Chunk n);
end
end
subgraph "Online Retrieval"
K[User Query] --> L{Encoder};
L --> M[Query Vector];
M --> H;
H -- "Vector Retrieval" --> N{Top-K Results};
N --> O[Get Original Chunk by ID];
end
style D fill:#c7f4c8,stroke:#333,stroke-width:2px;
style H fill:#f8d7da,stroke:#333,stroke-width:2px;
style E fill:#f9e79f,stroke:#333,stroke-width:2px;

Figure 2: Data-Augmented RAG Workflow with Hypothetical Questions Generation

This method greatly enriches the “retrievability” of each knowledge chunk, essentially creating multiple different “entry points” for each piece of knowledge.

2.2 Technical Path Two: Summarization & Paraphrasing

Besides generating questions, we can also generate summaries of knowledge chunks or rewrite them in different ways.

Summarization: For a relatively long knowledge chunk, an LLM can generate a concise core summary. This summary can serve as a “coarse-grained” retrieval entry point. When a user's query is relatively broad, it might more easily match with the summary.
Paraphrasing: Have the LLM rewrite the core content of the same knowledge chunk using different sentence structures and vocabulary. This also creates new vectors that are different from the original text vector but semantically consistent.

Technical Implementation Details:

The implementation method is similar to hypothetical question generation, except that the prompt's goal changes from “generating questions” to “generating summaries” or “paraphrasing”. The generated data is similarly associated with the original chunk, and its vector is stored in the database.

In practice, hypothetical question generation is usually more popular than summarization/paraphrasing because it more directly simulates the user's “questioning” behavior, aligning better with the essence of the retrieval task.

3. Prompt Engineering for Data Generalization: An Excellent Example

The quality of the prompt directly determines the quality of the generated data. A good prompt should be like a precise scalpel, guiding the LLM to generate the data we want.

Below is a well-considered prompt example designed for the “hypothetical questions generation” task:

### Role and Goal
You are an advanced AI assistant tasked with generating a set of high-quality, diverse questions for a given knowledge text (Context). These questions should be fully answerable by the provided text. Your goal is to help build a smarter Q&A system that can find answers regardless of how users phrase their questions, as long as they relate to the text content.
### Instructions
Based on the `[Original Text]` provided below, please generate **5** different questions.
### Requirements
1. **Diversity**: The generated questions must differ in sentence structure, wording, and intent. Try to ask from different angles, for example:
* **How-to type**: How to operate...?
* **Why type**: Why does...happen?
* **What is type**: What does...mean?
* **Comparison type**: What's the difference between...and...?
* **What-if type**: What if...?
2. **Persona**: Imagine you are different types of users asking questions:
* A **Beginner** who knows nothing about this field.
* An **Expert** seeking in-depth technical details.
* A **Student** looking for answers for an assignment.
3. **Fully Answerable**: Ensure each generated question can be fully and only answered using information from the `[Original Text]`. Don't ask questions that require external knowledge.
4. **Language Style**: Questions should be natural, clear, and conform to conversational English.
### Output Format
Please output strictly in the following JSON format, without any additional explanations or text:
```json
{
"generated_questions": [
{
"persona": "beginner",
"question": "First question here"
},
{
"persona": "expert",
"question": "Second question here"
},
{
"persona": "student",
"question": "Third question here"
},
// ... more questions
]
}

[Original Text]

{context_chunk}


#### Prompt Design Analysis:
* **Role and Goal**: Gives the LLM a clear positioning, helping it understand the significance of the task, rather than just mechanically executing it.
* **Diversity Requirements**: This is the most critical part. It guides the LLM to think from different dimensions, avoiding generating a large number of homogeneous questions (e.g., simply turning statements into questions).
* **Persona Role-Playing**: This instruction greatly enriches the diversity of questions. A beginner's questions might be broader and more colloquial, while an expert's questions might be more specific and technical.
* **Fully Answerable**: This is an important constraint, ensuring the strong relevance of generated questions to the original text, avoiding introducing noise.
* **JSON Output Format**: Forced structured output makes the LLM's return results easily parsable and processable by programs, an essential element in automated workflows.
## 4. Effect Validation: How to Measure the Effectiveness of Data Augmentation?
Data augmentation is not a process that is "automatically good once done"; a scientific evaluation system must be established to verify its effectiveness. Evaluation should be conducted from two aspects: **recall rate** and **final answer quality**.
### 4.1 Retrieval Evaluation
This is the core metric for evaluating improvements in the retrieval component.
#### Steps:
1. **Build an Evaluation Dataset**: This is the most critical step. You need to create a test set containing `(question, corresponding correct original Chunk_ID)` pairs. The questions in this test set should be as diverse as possible, simulating real user queries.
2. **Conduct Two Tests**:
* **Experimental Group A (Without Data Augmentation)**: Use the standard RAG process to retrieve with questions from the test set, recording the Top-K Chunk IDs recalled.
* **Experimental Group B (With Data Augmentation)**: Use a knowledge base integrated with data augmentation, retrieve with the same questions, and record the Top-K Chunk IDs recalled.
3. **Calculate Evaluation Metrics**:
* **Recall@K**: What proportion of questions in the test set had their corresponding correct Chunk_ID appear in the top K of the recall results? This is the most important metric. `Recall@K = (Number of correctly recalled questions) / (Total number of questions)`.
* **Precision@K**: How many of the top K results recalled are correct? For a single question, if there is only one correct answer, then Precision@K is either 1/K or 0.
* **MRR (Mean Reciprocal Rank)**: The average of the reciprocal of the rank of the correct answer in the recall list. This metric not only cares about whether it was recalled but also how high it was ranked. The higher the ranking, the higher the score. `MRR = (1/N) * Σ(1 / rank_i)`, where `N` is the total number of questions, and `rank_i` is the rank of the correct answer for the i-th question.
By comparing the `Recall@K` and `MRR` metrics of experimental groups A and B, you can quantitatively determine whether data augmentation has improved recall performance.
### 4.2 Generation Quality Evaluation
Improved recall rate is a prerequisite, but it doesn't completely equate to improved user experience. We also need to evaluate the final answers generated by the RAG system end-to-end.
#### Method One: Human Evaluation
This is the most reliable but most costly method.
1. **Design Evaluation Dimensions**:
* **Relevance**: Does the generated answer get to the point and address the user's question?
* **Accuracy/Factuality**: Is the information in the answer accurate and based on the retrieved knowledge?
* **Fluency**: Is the language of the answer natural and smooth?
2. **Conduct Blind Evaluation**: Have evaluators score (e.g., 1-5 points) or compare (A is better/B is better/tie) two sets of answers without knowing which answer comes from which system (before/after enhancement).
3. **Statistical Analysis**: Determine whether data augmentation has a positive impact on the final answer quality through statistical scores or win rates.
#### Method Two: LLM-based Automatic Evaluation
This is a more efficient alternative, using a more powerful, advanced LLM (such as GPT-4o, Claude 3.5 Sonnet) as a "judge".
1. **Design Evaluation Prompt**: Create a prompt asking the judge LLM to compare answers generated by different systems.
* **Input**: User question, retrieved context, System A's answer, System B's answer.
* **Instructions**: Ask the LLM to analyze from dimensions such as relevance and accuracy, determine which answer is better, and output scores and reasons in JSON format.
2. **Batch Execution and Analysis**: Run this evaluation process for all questions in the test set, then calculate win rates.
This method allows for large-scale, low-cost evaluation, making rapid iteration possible.
## 5. Conclusion and Future Outlook
**In summary, LLM-based data augmentation/generalization is a key technology for enhancing RAG system performance, especially for solving the "semantic gap" problem.** By pre-generating a large number of "virtual questions" or equivalent descriptions in the offline phase, it greatly enriches the retrievability of the knowledge base, making the system more adaptable to the diversity of user queries in the real world.
**Practical Considerations:**
* **Balance Between Cost and Quality**: Generating data incurs LLM API call costs and index storage costs. The number of data to generate for each chunk needs to be decided based on budget and performance improvement needs.
* **Cleaning Generated Data**: LLM generation is not 100% perfect and may produce low-quality or irrelevant questions. Consider adding a validation step to filter out poor-quality data.
**Future Outlook:**
* **Combination with Rerankers**: Data augmentation aims to improve "recall," while reranker models aim to optimize "ranking." Combining the two—ensuring relevant content is recalled through data augmentation, then fine-ranking through reranker models—is the golden combination for RAG optimization.
* **Multimodal Data Augmentation**: With the development of multimodal large models, future RAG will process more than just text. How to perform data augmentation for image and audio/video knowledge (e.g., generating text questions about image content) will be an interesting research direction.
* **Adaptive Data Augmentation**: Future systems might automatically discover recall failure cases based on real user queries online, and perform targeted data augmentation for relevant knowledge chunks, forming a continuously optimizing closed loop.

Ollama Practical Guide: Local Deployment and Management of Large Language Models

Fri, 27 Jun 2025 02:00:00 +0000

1. Introduction

Ollama is a powerful open-source tool designed to allow users to easily download, run, and manage large language models (LLMs) in local environments. Its core advantage lies in simplifying the deployment and use of complex models, enabling developers, researchers, and enthusiasts to experience and utilize state-of-the-art artificial intelligence technology on personal computers without specialized hardware or complex configurations.

Key Advantages:

Ease of Use: Complete model download, running, and interaction through simple command-line instructions.
Cross-Platform Support: Supports macOS, Windows, and Linux.
Rich Model Library: Supports numerous popular open-source models such as Llama 3, Mistral, Gemma, Phi-3, and more.
Highly Customizable: Through Modelfile, users can easily customize model behavior, system prompts, and parameters.
API-Driven: Provides a REST API for easy integration with other applications and services.
Open Source Community: Has an active community continuously contributing new models and features.

This document will provide a comprehensive introduction to Ollama's various features, from basic fundamentals to advanced applications, helping you fully master this powerful tool.

2. Quick Start

This section will guide you through installing and basic usage of Ollama.

2.1 Installation

Visit the Ollama official website to download and install the package suitable for your operating system.

2.2 Running Your First Model

After installation, open a terminal (or command prompt) and use the ollama run command to download and run a model. For example, to run the Llama 3 model:

ollama run llama3

On first run, Ollama will automatically download the required model files from the model library. Once the download is complete, you can directly converse with the model in the terminal.

2.3 Managing Local Models

You can use the following commands to manage locally downloaded models:

List Local Models:
```
ollama list
```
This command displays the name, ID, size, and modification time of all downloaded models.
Remove Local Models:
```
ollama rm <model_name>
```

3. Core Concepts

3.1 Modelfile

Modelfile is one of Ollama's core features. It's a configuration file similar to Dockerfile that allows you to define and create custom models. Through Modelfile, you can:

Specify a base model.
Set model parameters (such as temperature, top_p, etc.).
Define the model's system prompt.
Customize the model's interaction template.
Apply LoRA adapters.

A simple Modelfile example:

# Specify base model
FROM llama3
# Set model temperature
PARAMETER temperature 0.8
# Set system prompt
SYSTEM """
You are a helpful AI assistant. Your name is Roo.
"""

Use the ollama create command to create a new model based on a Modelfile:

ollama create my-custom-model -f ./Modelfile

3.2 Model Import

Ollama supports importing models from external file systems, particularly from Safetensors format weight files.

In a Modelfile, use the FROM directive and provide the directory path containing safetensors files:

FROM /path/to/safetensors/directory

Then use the ollama create command to create the model.

3.3 Multimodal Models

Ollama supports multimodal models (such as LLaVA) that can process both text and image inputs simultaneously.

ollama run llava "What's in this image? /path/to/image.png"

4. API Reference

Ollama provides a set of REST APIs for programmatically interacting with models. The default service address is http://localhost:11434.

4.1 `/api/generate`

Generate text.

Request (Streaming):

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'

Request (Non-streaming):

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'

4.2 `/api/chat`

Conduct multi-turn conversations.

Request:

curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'

4.3 `/api/embed`

Generate embedding vectors for text.

Request:

curl http://localhost:11434/api/embed -d '{
"model": "all-minilm",
"input": ["Why is the sky blue?", "Why is the grass green?"]
}'

4.4 `/api/tags`

List all locally available models.

Request:
```
curl http://localhost:11434/api/tags
```

5. Command Line Tools (CLI)

Ollama provides a rich set of command-line tools for managing models and interacting with the service.

ollama run <model>: Run a model.
ollama create <model> -f <Modelfile>: Create a model from a Modelfile.
ollama pull <model>: Pull a model from a remote repository.
ollama push <model>: Push a model to a remote repository.
ollama list: List local models.
ollama cp <source_model> <dest_model>: Copy a model.
ollama rm <model>: Delete a model.
ollama ps: View running models and their resource usage.
ollama stop <model>: Stop a running model and unload it from memory.

6. Advanced Features

6.1 OpenAI API Compatibility

Ollama provides an endpoint compatible with the OpenAI API, allowing you to seamlessly migrate existing OpenAI applications to Ollama. The default address is http://localhost:11434/v1.

List Models (Python):

from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.models.list()
print(response)

6.2 Structured Output

By combining the OpenAI-compatible API with Pydantic, you can force the model to output JSON with a specific structure.

from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
class UserInfo(BaseModel):
name: str
age: int
try:
completion = client.beta.chat.completions.parse(
model="llama3.1:8b",
messages=[{"role": "user", "content": "My name is John and I am 30 years old."}],
response_format=UserInfo,
)
print(completion.choices[0].message.parsed)
except Exception as e:
print(f"Error: {e}")

6.3 Performance Tuning

You can adjust Ollama's performance and resource management through environment variables:

OLLAMA_KEEP_ALIVE: Set how long models remain active in memory. For example, 10m, 24h, or -1 (permanent).
OLLAMA_MAX_LOADED_MODELS: Maximum number of models loaded into memory simultaneously.
OLLAMA_NUM_PARALLEL: Number of requests each model can process in parallel.

6.4 LoRA Adapters

Use the ADAPTER directive in a Modelfile to apply a LoRA (Low-Rank Adaptation) adapter, changing the model's behavior without modifying the base model weights.

FROM llama3
ADAPTER /path/to/your-lora-adapter.safetensors

7. Appendix

7.1 Troubleshooting

Check CPU Features: On Linux, you can use the following command to check if your CPU supports instruction sets like AVX, which are crucial for the performance of certain models.
```
cat /proc/cpuinfo | grep flags | head -1
```

7.2 Contribution Guidelines

Ollama is an open-source project, and community contributions are welcome. When submitting code, please follow good commit message formats, for example:

Good: llm/backend/mlx: support the llama architecture
Bad: feat: add more emoji

Official Website: https://ollama.com/
GitHub Repository: https://github.com/ollama/ollama
Model Library: https://ollama.com/library

Artificial Intelligence | Ziyang Lin

LLM Agent Multi-Turn Dialogue: Architecture Design and Implementation Strategies

1. Introduction: Why Multi-Turn Dialogue is the Core Lifeline of Agents

2. Core Challenges: “Thorny Issues” in Multi-Turn Dialogues

2.1 Context Window Limitation

2.2 State Maintenance Complexity

2.3 Intent Drifting & Goal Forgetting

2.4 Error Handling & Self-Correction

3. Technical Architecture Evolution and Analysis

3.1 Early Attempts: Dialogue History Compression

3.2 ReAct Architecture: Giving Agents the Ability to “Think”

ReAct Work Cycle

3.3 Finite State Machine (FSM): Building “Tracks” for Dialogue Flow

A Simple Food Ordering Agent

Modern Evolution of FSM: Dynamic and Hierarchical

Example: State-Specific Prompt for Query_Flights State in Flight Booking Sub-Process

Hierarchical State Machine (SOP Nesting) Example

4. Core Components: Agent's “Memory” System

4.1 Short-term Memory

4.2 Long-term Memory

4.3 Structured Memory

Graphiti/Zep Temporal Knowledge Graph Architecture and Workflow

4.4 Summary Memory

4.5 User Profile Memory

5. Summary and Outlook

Retrieval-Augmented Generation (RAG): A Comprehensive Technical Analysis

1. Macro Overview: Why RAG?

1.1 What is RAG?

1.2 RAG's Core Value: Solving LLM's Inherent Limitations

1.3 RAG's Macro Workflow

2. RAG Core Architecture: Dual Process Analysis

2.1 Offline Process: Indexing

2.2 Online Process: Retrieval & Generation

3. Indexing Deep Dive

3.1 Data Loading

3.2 Text Splitting / Chunking

3.2.1 Core Parameters: chunk_size and chunk_overlap

3.2.2 Mainstream Chunking Strategies

3.3 Embedding

3.3.1 Embedding Model Selection

4. Retrieval Technology Deep Dive

4.1 Traditional Foundation: Sparse Retrieval

4.2 Modern Core: Dense Retrieval / Vector Search

4.2.1 Approximate Nearest Neighbor (ANN) Search

4.3 Powerful Combination: Hybrid Search

4.4 Frontier Exploration: Advanced Retrieval Strategies

4.4.1 Contextual Compression & Re-ranking

4.4.2 Self-Querying Retriever

4.4.3 Multi-Vector Retriever

4.4.4 Parent Document Retriever

4.4.5 Graph RAG

4.4.6 Agentic RAG / Adaptive RAG

5. Generation Phase: The Final Touch

5.1 Core Task: Effective Prompt Engineering

5.1.1 Template Key Elements Analysis

5.2 Context and Question Fusion

6. RAG Evaluation Framework: How to Measure System Quality?

6.1 Core Evaluation Dimensions

6.2 Key Evaluation Metrics (Using RAGAS as an Example)

6.2.1 Evaluating Generation Quality

6.2.2 Evaluating Both Retrieval and Generation Quality

6.2.3 Evaluating Retrieval Quality

6.3 Using Evaluation to Guide Iteration

7. Challenges and Future Outlook

7.1 Current Challenges

7.2 Future Outlook

Model Context Protocol (MCP): A Standardized Framework for AI Capability Extension

1. Macro Introduction: Why Do We Need MCP Beyond Tool Calling?

2. MCP Core Architecture: A Trinity Collaboration Model

Detailed Architecture Responsibilities:

3. Communication Protocol Deep Dive: MCP's Neural Network

3.1. Local Communication: Standard Input/Output (Stdio)

3.2. Remote Communication: Server-Sent Events (HTTP SSE)

4. MCP Message Format Breakdown: The Protocol's “Common Language”

4.1. <use_mcp_tool>: Calling a Tool

4.2. <access_mcp_resource>: Accessing a Resource

5. Building an MCP Server: From Concept to Code Skeleton

6. Practical Exercise: Using the MCP-Driven context7 Server to Answer Technical Questions

Process Breakdown and MCP Value Demonstration

7. Conclusion: MCP's Value and Future—Building the “Internet” of AI

Example: State-Specific Prompt for `Query_Flights` State in Flight Booking Sub-Process

3.2.1 Core Parameters: `chunk_size` and `chunk_overlap`

4.1. `<use_mcp_tool>`: Calling a Tool

4.2. `<access_mcp_resource>`: Accessing a Resource

`tools` Parameter: Your “Toolbox”

`tool_choice` Parameter: Controlling the LLM's Choice

7.5. Convenient High-Level API (`LLM`)

4.1 `/api/generate`

4.2 `/api/chat`

4.3 `/api/embed`

4.4 `/api/tags`