API Integration | Ziyang Lin

LLM Tool Calling: The Key Technology Breaking AI Capability Boundaries

Mon, 30 Jun 2025 07:00:00 +0000

1. Macro Overview: Why Tool Calling is LLM's “Super Plugin”

The emergence of Large Language Models (LLMs) has fundamentally changed how we interact with machines. However, LLMs have an inherent, unavoidable “ceiling”: they are essentially “probability prediction machines” trained on massive text data, with their knowledge frozen at the time their training data ends. This means an LLM cannot know “what's the weather like today?", cannot access your company's internal database, and cannot book a flight ticket for you.

The LLM Tool Calling / Function Calling mechanism emerged precisely to break through this ceiling. It gives LLMs an unprecedented ability: calling external tools (APIs, functions, databases, etc.) to obtain real-time information, perform specific tasks, or interact with the external world when needed.

In simple terms, the tool calling mechanism upgrades LLMs from “knowledgeable conversationalists” to capable “intelligent agents.” It allows LLMs to:

Obtain real-time information: By calling weather APIs, news APIs, search engines, etc., to get the latest information beyond the model's training data.
Operate external systems: Connect to enterprise CRM/ERP systems to query data, or connect to IoT devices to control smart home appliances.
Execute complex tasks: Break down complex user instructions (like “help me find and book a cheap flight to Shanghai next week”) and complete them by calling multiple APIs in combination.
Provide more precise, verifiable answers: For queries requiring exact calculations or structured data, LLMs can call calculators or databases instead of relying on their potentially inaccurate internal knowledge.

Therefore, tool calling is not just a simple extension of LLM functionality, but a core foundation for building truly powerful AI applications that deeply integrate with both the physical and digital worlds.

2. Core Concepts and Workflow: How Do LLMs “Learn” to Use Tools?

To understand the underlying logic of tool calling, we need to view it as an elegant process involving three core roles working together:

Large Language Model (LLM): The brain and decision-maker.
Tool Definitions: A detailed “tool instruction manual.”
Developer/Client-side Code: The ultimate “executor.”

The LLM itself never actually executes any code. Its only task, after understanding the user's intent and the “tool manual” it has, is to generate a JSON data structure that precisely describes which tool should be called and with what parameters.

Below is a visual explanation of this process:

sequenceDiagram
participant User
participant Client as Client/Application Layer
participant LLM as Large Language Model
participant Tools as External Tools/APIs
User->>+Client: "What's the weather in Beijing today?"
Client->>+LLM: Submit user request + Tool Definitions
Note over LLM: 1. Understand user intent<br/>2. Match most appropriate tool (get_weather)<br/>3. Extract required parameters (location: "Beijing")
LLM-->>-Client: Return JSON: {"tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"location\": \"Beijing\"}"}}]}
Client->>+Tools: 2. Based on LLM's JSON, call the actual get_weather("Beijing") function
Tools-->>-Client: Return weather data (e.g.: {"temperature": "25°C", "condition": "sunny"})
Client->>+LLM: 3. Submit tool execution result back to LLM
Note over LLM: 4. Understand the data returned by the tool
LLM-->>-Client: 5. Generate user-friendly natural language response
Client->>-User: "The weather in Beijing today is sunny with a temperature of 25 degrees Celsius."

Process Breakdown:

Define & Describe:
- Developers first need to define available tools in a structured way (typically using JSON Schema). This “manual” is crucial to the entire process and must clearly tell the LLM:
  - Tool name (name): For example, get_weather.
  - Tool function description (description): For example, “Get real-time weather information for a specified city.” This is the most important basis for the LLM to understand the tool's purpose.
  - Tool parameters (parameters): Detailed definition of what inputs the tool needs, including each input's name, type (string, number, boolean, etc.), whether it's required, and parameter descriptions.
Intent Recognition & Parameter Extraction:
- When a user makes a request (e.g., “Check the weather in Beijing”), the developer's application sends the user's original request along with all the tool definitions from step 1 to the LLM.
- The LLM's core task is to do two things:
  - Intent Recognition: Among all available tools, determine which tool's function description best matches the user's request. In this example, it would match get_weather.
  - Parameter Extraction: From the user's request, identify and extract values that satisfy the tool's parameter requirements. Here, it would recognize that the location parameter value is “Beijing”.
- After completing these two steps, the LLM generates one or more tool_calls objects, essentially saying “I suggest you call the function named get_weather and pass in the parameter { "location": "Beijing" }”.
Execute & Observe:
- The developer's application code receives the JSON returned by the LLM and parses this “call suggestion.”
- The application code actually executes the get_weather("Beijing") function locally or on the server side.
- After execution, it gets a real return result, such as a JSON object containing weather information.
Summarize & Respond:
- To complete the loop, the application layer needs to submit the actual execution result from the previous step back to the LLM.
- This time, the LLM's task is to understand this raw data returned by the tool (e.g., {"temperature": "25°C", "condition": "sunny"}) and convert it into a fluent, natural, user-friendly response.
- Finally, the user receives the reply “The weather in Beijing today is sunny with a temperature of 25 degrees Celsius,” and the entire process is complete.

This process elegantly combines the LLM's powerful natural language understanding ability with the external tool's powerful functional execution capability, achieving a 1+1>2 effect.

3. Technical Deep Dive: Analyzing the Industry Standard (OpenAI Tool Calling)

OpenAI's API is currently the de facto standard in the field of LLM tool calling, and its design is widely emulated. Understanding its implementation details is crucial for any developer looking to integrate LLM tool calling into their applications.

3.1. Core API Parameters

When calling OpenAI's Chat Completions API, there are two main parameters related to tool calling: tools and tool_choice.

`tools` Parameter: Your “Toolbox”

The tools parameter is an array where you can define one or more tools. Each tool follows a fixed structure, with the core being a function object defined based on the JSON Schema specification.

Example: Defining a weather tool and a flight booking tool

[
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get real-time weather information for a specified location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state/province name, e.g., 'San Francisco, CA'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "book_flight",
"description": "Book a flight ticket for the user from departure to destination",
"parameters": {
"type": "object",
"properties": {
"departure": {
"type": "string",
"description": "Departure airport or city"
},
"destination": {
"type": "string",
"description": "Destination airport or city"
},
"date": {
"type": "string",
"description": "Desired departure date in YYYY-MM-DD format"
}
},
"required": ["departure", "destination", "date"]
}
}
}
]

Key Points Analysis:

type: Currently fixed as "function".
function.name: Function name. Must be a combination of letters, numbers, and underscores, not exceeding 64 characters. This is the key for your code to identify which function to call.
function.description: Critically important. This is the main basis for the LLM to decide whether to select this tool. The description should clearly, accurately, and unambiguously explain what the function does. A good description can greatly improve the LLM's call accuracy.
function.parameters: A standard JSON Schema object.
- type: Must be "object".
- properties: Defines each parameter's name, type (string, number, boolean, array, object), and description. The parameter description is equally important as it helps the LLM understand what information to extract from user input to fill this parameter.
- required: An array of strings listing which parameters are mandatory. If the user request lacks necessary information, the LLM might ask follow-up questions or choose not to call the tool.

`tool_choice` Parameter: Controlling the LLM's Choice

By default, the LLM decides on its own whether to respond with text or call one or more tools based on the user's input. The tool_choice parameter allows you to control this behavior more precisely.

"none": Forces the LLM not to call any tools and directly return a text response.
"auto" (default): The LLM can freely choose whether to respond with text or call tools.
{"type": "function", "function": {"name": "my_function"}}: Forces the LLM to call this specific tool named my_function.

This parameter is very useful in scenarios where you need to enforce a specific process or limit the LLM's capabilities.

3.2. Request-Response Lifecycle

A complete tool calling interaction involves at least two API requests.

First Request: From User to LLM

# request
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Please book me a flight from New York to London tomorrow"}],
tools=my_tools, # The tool list defined above
tool_choice="auto"
)

First Response: LLM's “Call Suggestion”

If the LLM decides to call a tool, the API response's finish_reason will be tool_calls, and the message object will contain a tool_calls array.

{
"choices": [
{
"finish_reason": "tool_calls",
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "book_flight",
"arguments": "{\"departure\":\"New York\",\"destination\":\"London\",\"date\":\"2025-07-01\"}"
}
}
]
}
}
],
...
}

Key Points Analysis:

finish_reason: A value of "tool_calls" indicates that the LLM wants you to execute a tool call, rather than ending the conversation.
message.role: assistant.
message.tool_calls: This is an array, meaning the LLM can request multiple tool calls at once.
- id: A unique call ID. In subsequent requests, you'll need to use this ID to associate the tool's execution results.
- function.name: The function name the LLM suggests calling.
- function.arguments: A JSON object in string form. You need to parse this string to get the specific parameters needed to call the function.

Second Request: Returning Tool Results to the LLM

After executing the tool in your code, you need to send the results back to the LLM to complete the conversation. At this point, you need to construct a new messages list that includes:

The original user message.
The assistant message returned by the LLM in the previous step (containing tool_calls).
A new message with the tool role, containing the tool's execution results.

# message history
messages = [
{"role": "user", "content": "Please book me a flight from New York to London tomorrow"},
response.choices[0].message, # Assistant's 'tool_calls' message
{
"tool_call_id": "call_abc123", # Must match the ID from the previous step
"role": "tool",
"name": "book_flight",
"content": "{\"status\": \"success\", \"ticket_id\": \"TICKET-45678\"}" # Actual return value from the tool
}
]
# second request
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

Second Response: LLM's Final Reply

This time, the LLM will generate a natural language response for the user based on the tool's returned results.

{
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Great! I've booked your flight from New York to London for tomorrow. Your ticket ID is TICKET-45678."
}
}
],
...
}

With this, a complete tool calling cycle is finished.

4. Code Implementation: A Complete Python Example

Below is an end-to-end Python example using OpenAI's Python library to demonstrate how to implement a weather query feature.

import os
import json
from openai import OpenAI
from dotenv import load_dotenv
# --- 1. Initial Setup ---
load_dotenv() # Load environment variables from .env file
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# --- 2. Define Our Local Tool Functions ---
# This is a mock function; in a real application, it would call an actual weather API
def get_current_weather(location, unit="celsius"):
"""Get real-time weather information for a specified location"""
if "New York" in location:
return json.dumps({
"location": "New York",
"temperature": "10",
"unit": unit,
"forecast": ["sunny", "light breeze"]
})
elif "London" in location:
return json.dumps({
"location": "London",
"temperature": "15",
"unit": unit,
"forecast": ["light rain", "northeast wind"]
})
else:
return json.dumps({"location": location, "temperature": "unknown"})
# --- 3. Main Execution Flow ---
def run_conversation(user_prompt: str):
print(f"👤 User: {user_prompt}")
# Step 1: Send the user's message and tool definitions to the LLM
messages = [{"role": "user", "content": user_prompt}]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get real-time weather information for a specified city",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., New York City",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto",
)
response_message = response.choices[0].message
tool_calls = response_message.tool_calls
# Step 2: Check if the LLM decided to call a tool
if tool_calls:
print(f"🤖 LLM decided to call tool: {tool_calls[0].function.name}")
# Add the LLM's reply to the message history
messages.append(response_message)
# Step 3: Execute the tool call
# Note: This example only handles the first tool call
tool_call = tool_calls[0]
function_name = tool_call.function.name
function_to_call = globals().get(function_name) # Get the function from the global scope
if not function_to_call:
print(f"❌ Error: Function {function_name} is not defined")
return
function_args = json.loads(tool_call.function.arguments)
# Call the function and get the result
function_response = function_to_call(
location=function_args.get("location"),
unit=function_args.get("unit"),
)
print(f"🛠️ Tool '{function_name}' returned: {function_response}")
# Step 4: Return the tool's execution result to the LLM
messages.append(
{
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": function_response,
}
)
print("🗣️ Submitting tool result back to LLM, generating final response...")
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
final_response = second_response.choices[0].message.content
print(f"🤖 LLM final response: {final_response}")
return final_response
else:
# If the LLM didn't call any tools, directly return its text content
final_response = response_message.content
print(f"🤖 LLM direct response: {final_response}")
return final_response
# --- Run Examples ---
if __name__ == "__main__":
run_conversation("What's the weather like in London today?")
print("\n" + "="*50 + "\n")
run_conversation("How are you?")

This example clearly demonstrates the entire process from defining tools, sending requests, handling tool_calls, executing local functions, to sending results back to the model to get the final answer.

5. Advanced Topics and Best Practices

After mastering the basic process, we need to understand some advanced usage and design principles to build more robust and reliable tool calling systems.

5.1. Parallel Tool Calling

Newer models (like gpt-4o) support parallel tool calling. This means the model can request multiple different, independent tools to be called in a single response.

Scenario Example: User asks: “What's the weather like in New York and London today?”

The model might return a response containing two tool_calls:

get_current_weather(location="New York")
get_current_weather(location="London")

Your code needs to be able to iterate through each tool_call object in the message.tool_calls array, execute them separately, collect all results, and then submit these results together in a new request to the model.

Code Handling Logic:

# ... (received response_message containing multiple tool_calls)
messages.append(response_message) # Add assistant's reply to messages
# Execute functions for each tool call and collect results
tool_outputs = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_to_call = available_functions[function_name]
function_args = json.loads(tool_call.function.arguments)
output = function_to_call(**function_args)
tool_outputs.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": output,
})
# Add all tool outputs to the message history
messages.extend(tool_outputs)
# Call the model again
second_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

5.2. Error Handling

Tool calls are not always successful. APIs might time out, databases might be unreachable, or the function execution itself might throw exceptions. Gracefully handling these errors is crucial.

When a tool execution fails, you should catch the exception and return structured information describing the error as the result of the tool call to the LLM.

Example:

try:
# Try to call the API
result = some_flaky_api()
content = json.dumps({"status": "success", "data": result})
except Exception as e:
# If it fails, return error information
content = json.dumps({"status": "error", "message": f"API call failed: {str(e)}"})
# Return the result (whether successful or failed) to the LLM
messages.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": content,
})

When the LLM receives error information, it typically responds to the user with an apologetic answer that reflects the problem (e.g., “Sorry, I'm currently unable to retrieve weather information. Please try again later.") rather than causing the entire application to crash.

5.3. Designing Effective Tool Descriptions

The quality of the tool description (description) directly determines the LLM's call accuracy.

Clear and Specific: Avoid using vague terms.
- Bad: “Get data”
- Good: “Query the user's order history from the company's CRM system based on user ID”
Include Key Information and Limitations: If the tool has specific limitations, be sure to mention them in the description.
- Example: “Query flight information. Note: This tool can only query flights within the next 30 days and cannot query historical flights.”
Start with a Verb: Use a clear verb to describe the core functionality of the function.
Clear Parameter Descriptions: The description of parameters is equally important; it guides the LLM on how to correctly extract information from user conversations.
- Bad: "date": "A date"
- Good: "date": "Booking date, must be a string in YYYY-MM-DD format"

5.4. Security Considerations

Giving LLMs the ability to call code is a double-edged sword and must be handled with caution.

Never Execute Code Generated by LLMs: The LLM's output is a “call suggestion,” not executable code. Never use eval() or similar methods to directly execute strings generated by LLMs. You should parse the suggested function name and parameters, then call your pre-defined, safe, and trusted local functions.
Confirmation and Authorization: For operations with serious consequences (like deleting data, sending emails, making payments), implement a confirmation mechanism before execution. This could be forcing user confirmation at the code level or having the LLM generate a confirmation message after generating the call suggestion.
Principle of Least Privilege: Only provide the LLM with the minimum tools necessary to complete its task. Don't expose your entire codebase or irrelevant APIs.

6. Conclusion and Future Outlook

LLM tool calling is one of the most breakthrough advances in artificial intelligence in recent years. It transforms LLMs from closed “language brains” into open, extensible “intelligent agent” cores capable of interacting with the world. By combining the powerful natural language understanding capabilities of LLMs with the unlimited functionality of external tools, we can build unprecedented intelligent applications.

From querying weather and booking hotels to controlling smart homes, analyzing corporate financial reports, and automating software development processes, tool calling is unlocking countless possibilities. As model capabilities continue to strengthen, tool description understanding will become more precise, multi-tool coordination will become more complex and intelligent, and error handling and self-correction capabilities will become stronger.

In the future, we may see more complex Agentic architectures where LLMs not only call tools but can dynamically create, combine, and even optimize tools. Mastering the principles and practices of LLM tool calling is not only an essential skill to keep up with the current AI technology wave but also a key to future intelligent application development.

LLM Hyperparameter Tuning Guide: A Comprehensive Analysis from Generation to Deployment

Fri, 27 Jun 2025 03:00:00 +0000

Introduction

Behind the powerful capabilities of large language models (LLMs) is a series of complex hyperparameters working silently. Whether you're deploying a local inference service like vLLM or calling OpenAI's API, precisely tuning these parameters is crucial for achieving ideal performance, cost, and output quality. This document provides a detailed analysis of two key categories of hyperparameters: Generation (Sampling) Parameters and Deployment (Serving) Parameters, helping you fully master their functions, values, impacts, and best practices across different scenarios.

Part 1: Generation (Sampling) Parameters — Controlling Model Creativity and Determinism

Generation parameters directly control the model's behavior when generating the next token. They primarily revolve around a core question: how to select from thousands of possible next words in the probability distribution provided by the model.

1. `temperature`

In one sentence: Controls the randomness of generated text. Higher temperature increases randomness, making responses more creative and diverse; lower temperature decreases randomness, making responses more deterministic and conservative.

Underlying Principle: When generating the next token, the model calculates logits (raw, unnormalized prediction scores) for all words in the vocabulary. Typically, we use the Softmax function to convert these logits into a probability distribution. The temperature parameter is introduced before the Softmax calculation, “smoothing” or “sharpening” this probability distribution.

The standard Softmax formula is: P(i) = exp(logit_i) / Σ_j(exp(logit_j))

With temperature (T) introduced, the formula becomes: P(i) = exp(logit_i / T) / Σ_j(exp(logit_j / T))
- When T -> 0, the differences in logit_i / T become dramatically amplified. The token with the highest logit approaches a probability of 1, while all other tokens approach 0. This causes the model to almost always choose the most likely word, behaving very deterministically and “greedily.”
- When T = 1, the formula reverts to standard Softmax, and the model behaves in its “original” state.
- When T > 1, the differences in logit_i / T are reduced. Tokens with originally lower probabilities get boosted, making the entire probability distribution “flatter.” This increases the chance of selecting less common words, introducing more randomness and creativity.
Value Range and Recommendations:
- Range: [0.0, 2.0] (theoretically can be higher, but OpenAI API typically limits to 2.0).
- temperature = 0.0: Suitable for scenarios requiring deterministic, reproducible, and highly accurate outputs. Examples: code generation, factual Q&A, text classification, data extraction. With identical inputs, outputs will be almost identical (unless the model itself is updated).
- Low temperature (e.g., 0.1 - 0.4): Suitable for semi-creative tasks requiring rigor and fidelity to source material. Examples: article summarization, translation, customer service bots. Outputs will vary slightly but remain faithful to core content.
- Medium temperature (e.g., 0.5 - 0.8): A good balance between creativity and consistency, recommended as the default for most applications. Examples: writing emails, marketing copy, brainstorming.
- High temperature (e.g., 0.9 - 1.5): Suitable for highly creative tasks. Examples: poetry writing, story creation, dialogue script generation. Outputs will be very diverse and sometimes surprising, but may occasionally produce meaningless or incoherent content.
Note:
- It's generally not recommended to modify both temperature and top_p simultaneously; it's better to adjust just one. OpenAI's documentation explicitly states that modifying only one is typically advised.

2. `top_p` (Nucleus Sampling)

In one sentence: Controls generation diversity by dynamically determining the sampling pool size through a cumulative probability threshold (p) of the highest probability tokens.

Underlying Principle: top_p is a more intelligent sampling strategy than temperature, also known as Nucleus Sampling. Instead of adjusting all token probabilities, it directly defines a “core” candidate set.

The specific steps are as follows:
1. The model calculates the probability distribution for all candidate tokens.
2. All tokens are sorted by probability from highest to lowest.
3. Starting from the highest probability token, their probabilities are cumulatively added until this sum exceeds the set top_p threshold.
4. All tokens included in this cumulative sum form the “nucleus” for sampling.
5. The model will only sample from this nucleus (typically renormalizing their probabilities), and all other tokens are ignored.
Example: Assume top_p = 0.9.
- If the highest probability token “the” has a probability of 0.95, then the nucleus will contain only “the”, and the model will choose it 100%.
- If “the” has a probability of 0.5, “a” has 0.3, and “an” has 0.1, then the cumulative probability of these three words is 0.9. The nucleus will contain {“the”, “a”, “an”}. The model will sample from these three words according to their (renormalized) probabilities.
Value Range and Recommendations:
- Range: (0.0, 1.0].
- top_p = 1.0: Means the model considers all tokens without any truncation (equivalent to no top_p).
- High top_p (e.g., 0.9 - 1.0): Allows for more diverse choices, suitable for creative tasks, similar in effect to higher temperature.
- Low top_p (e.g., 0.1 - 0.3): Greatly restricts the model's range of choices, making its output very deterministic and conservative, similar in effect to extremely low temperature.
- General Recommended Value: 0.9 is a very common default value as it maintains high quality while allowing for some diversity.
top_p vs temperature:
- top_p is more dynamic and adaptive. When the model is very confident about the next step (sharp probability distribution), top_p automatically narrows the candidate set, ensuring quality. When the model is less confident (flat distribution), it expands the candidate set, increasing diversity.
- temperature adjusts the entire distribution “equally,” regardless of whether the distribution itself is sharp or flat.
- Therefore, top_p is generally considered a safer and more robust method for controlling diversity than temperature.

3. `top_k`

In one sentence: Simply and directly samples only from the k tokens with the highest probabilities.

Underlying Principle: This is the simplest truncation sampling method. It directly selects the k tokens with the highest probabilities to form the candidate set, then samples from these k tokens. All other tokens are ignored.
Value Range and Recommendations:
- Range: Integers, such as 1, 10, 50.
- top_k = 1: Equivalent to greedy search, always choosing the most likely word.
- Recommendation: top_k is typically not the preferred sampling strategy because it's too “rigid.” In cases where the probability distribution is very flat, it might accidentally exclude many reasonable words; while in cases where the distribution is very sharp, it might include many extremely low-probability, useless words. top_p is usually a better choice.

4. `repetition_penalty`

In one sentence: Applies a penalty to tokens that have already appeared in the context, reducing their probability of being selected again, thereby reducing repetitive content.

Underlying Principle: After calculating logits but before Softmax, this parameter iterates through all candidate tokens. If a token has already appeared in the previous context, its logit value is reduced (typically divided by the value of repetition_penalty).

new_logit = logit / penalty (if token has appeared) new_logit = logit (if token has not appeared)

This way, the final probability of words that have already appeared decreases.
Value Range and Recommendations:
- Range: 1.0 to 2.0 is common.
- 1.0: No penalty applied (default value).
- 1.1 - 1.3: A relatively safe range that can effectively reduce unnecessary repetition without overly affecting normal language expression (such as necessary articles like “the”).
- Too High Values: May cause the model to deliberately avoid common words, producing unnatural or even strange sentences.

5. `frequency_penalty` & `presence_penalty`

These two parameters are more refined versions of repetition_penalty.

presence_penalty:
- Function: Applies a fixed penalty to all tokens that have appeared at least once in the context. It doesn't care how many times the token has appeared; as long as it has appeared, it gets penalized.
- Underlying Principle: new_logit = logit - presence_penalty (if token has appeared at least once).
- Scenario: This parameter is useful when you want to encourage the model to introduce entirely new concepts and vocabulary, rather than repeatedly discussing topics that have already been mentioned.
- Range: 0.0 to 2.0. Positive values penalize new tokens, negative values encourage them.
frequency_penalty:
- Function: The penalty is proportional to the frequency of the token in the context. The more times a word appears, the heavier the penalty it receives.
- Underlying Principle: new_logit = logit - count(token) * frequency_penalty.
- Scenario: This parameter is effective when you find the model tends to repeatedly use certain specific high-frequency words (even if they are necessary), leading to monotonous language.
- Range: 0.0 to 2.0.
Summary: presence_penalty addresses the question of “whether it has appeared,” while frequency_penalty addresses “how many times it has appeared.”

6. `seed`

In one sentence: By providing a fixed seed, you can make the model's output reproducible when other parameters (such as temperature) remain the same.

Function: In machine learning, many operations that seem random are actually “pseudo-random,” determined by an initial “seed.” Setting the same seed will produce the same sequence of random numbers. In LLMs, this means the sampling process will be completely deterministic.
Scenarios:
- Debugging and Testing: When you need to verify whether a change has affected the output, fixing the seed can eliminate randomness interference.
- Reproducible Research: Reproducibility is crucial in academic research.
- Generating Consistent Content: When you need the model to consistently produce outputs in the same style for the same input.
Note: For complete reproduction, all generation parameters (prompt, model, temperature, top_p, etc.) must be identical.

Part 2: Deployment (Serving) Parameters — Optimizing Service Performance and Capacity

Deployment parameters determine how an LLM inference service manages GPU resources, handles concurrent requests, and optimizes overall throughput and latency. These parameters are particularly important in high-performance inference engines like vLLM.

1. `gpu_memory_utilization`

In one sentence: Controls the proportion of GPU memory that vLLM can use, with the core purpose of reserving space for the KV Cache.

Underlying Principle (PagedAttention): The core of vLLM is the PagedAttention mechanism. Traditional attention mechanisms pre-allocate a continuous, maximum-length memory space for each request to store the Key-Value (KV) Cache. This leads to severe memory waste, as most requests are far shorter than the maximum length.

PagedAttention manages the KV Cache like virtual memory in an operating system:
1. It breaks down each sequence's KV Cache into many small, fixed-size “blocks.”
2. These blocks can be stored non-contiguously in GPU memory.
3. A central “Block Manager” is responsible for allocating and releasing these blocks.
gpu_memory_utilization tells vLLM: “You can use this much proportion of the total GPU memory for free management (mainly storing model weights and physical blocks of KV Cache).”
Value Range and Impact:
- Range: (0.0, 1.0].
- Default Value: 0.9 (i.e., 90%).
- Higher Values (e.g., 0.95):
  - Advantage: vLLM has more memory for KV Cache, supporting longer contexts and larger batch sizes, thereby increasing throughput.
  - Risk: If set too high, there might not be enough spare memory for CUDA kernels, drivers, or other system processes, easily leading to OOM (Out of Memory) errors.
- Lower Values (e.g., 0.8):
  - Advantage: Safer, less prone to OOM, reserves more memory for the system and other applications.
  - Disadvantage: Reduced available space for KV Cache, potentially causing vLLM to struggle with high concurrency or long sequence requests, degrading performance. When KV Cache is insufficient, vLLM triggers Preemption, swapping out some running sequences and waiting to swap them back in when there's enough space, severely affecting latency. vLLM's warning log "there is not enough KV cache space. This can affect the end-to-end performance." is reminding you of this issue.
Recommendations:
- Start with the default value of 0.9.
- If you encounter OOM, gradually lower this value.
- If you encounter many preemption warnings and confirm no other processes are occupying large amounts of GPU memory, you can gradually increase this value.

2. `max_num_seqs`

In one sentence: Limits the maximum number of sequences (requests) that the vLLM scheduler can process in one iteration (or one batch).

Underlying Principle: vLLM's scheduler selects a batch of requests from the waiting queue in each processing cycle. This parameter directly limits the size of this “batch.” Together with max_num_batched_tokens (which limits the total number of tokens across all sequences in a batch), it determines the scale of batch processing.
Value Range and Impact:
- Range: Positive integers, such as 16, 64, 256.
- Higher Values:
  - Advantage: Allows for higher concurrency, potentially improving GPU utilization and overall throughput.
  - Disadvantage: Requires more intermediate memory (e.g., for storing logits and sampling states) and may increase the latency of individual batches. If set too high, even if KV Cache still has space, OOM might occur due to insufficient temporary memory.
- Lower Values:
  - Advantage: More memory-friendly, potentially lower latency for individual batches.
  - Disadvantage: Limits concurrency capability, potentially leading to underutilization of GPU and decreased throughput.
Recommendations:
- This value needs to be adjusted based on your GPU memory size, model size, and expected concurrent load.
- For high-concurrency scenarios, try gradually increasing this value while monitoring GPU utilization and memory usage.
- For interactive, low-latency scenarios, consider setting this value lower.

3. `max_model_len`

In one sentence: Sets the maximum context length the model can process (including both prompt and generated tokens).

Underlying Principle: This parameter directly determines how much logical space vLLM needs to reserve for the KV Cache. For example, if max_model_len = 4096, vLLM must ensure its memory management mechanism can support storing KV pairs for up to 4096 tokens per sequence. This affects vLLM's memory planning at startup, such as the size of Position Embeddings.
Value Range and Impact:
- Range: Positive integers, cannot exceed the maximum length the model was originally trained on.
- Higher Values:
  - Advantage: Can handle longer documents and more complex contexts.
  - Disadvantage: Significantly increases memory consumption. Each token needs to store KV Cache; doubling the length roughly doubles the memory usage. Even if current requests are short, vLLM needs to prepare for potentially long requests, which occupies more KV Cache blocks.
- Lower Values:
  - Advantage: Significantly saves GPU memory. If you know your application scenario will never exceed 1024 tokens, setting this value to 1024 instead of the default 4096 or 8192 will free up a large amount of KV Cache space, supporting higher concurrency.
  - Disadvantage: Any requests exceeding this length will be rejected or truncated.
Recommendations:
- Set as needed! This is one of the most effective parameters for optimizing vLLM memory usage. Based on your actual application scenario, set this value to a reasonable maximum with some margin.

4. `tensor_parallel_size` & `pipeline_parallel_size`

These two parameters are used for deploying extremely large models across multiple GPUs or nodes.

tensor_parallel_size:
- Function: Divides each layer of the model (such as a large weight matrix) into N parts (N = tensor_parallel_size), placing them on N different GPUs. During computation, each GPU only processes its own portion of the data, then exchanges necessary results through high-speed interconnects (like NVLink) via All-Reduce operations, finally merging to get the complete output.
- Scenario: Used when a single model's volume exceeds the memory of a single GPU. For example, a 70B model cannot fit into a single 40GB A100, but can be deployed across two A100s by setting tensor_parallel_size=2.
- Impact:
  - Advantage: Achieves model parallelism, solving the problem of models not fitting on a single card.
  - Disadvantage: Introduces significant cross-GPU communication overhead, potentially affecting latency. Requires high-speed interconnects between GPUs.
pipeline_parallel_size:
- Function: Assigns different layers of the model to different GPUs or nodes. For example, placing layers 1-10 on GPU 1, layers 11-20 on GPU 2, and so on. Data flows through these GPUs like a pipeline.
- Scenario: Used when the model is extremely large and needs to be deployed across multiple nodes (machines).
- Impact:
  - Advantage: Can scale the model to any number of GPUs/nodes.
  - Disadvantage: Creates “pipeline bubbles” as additional overhead, where some GPUs are idle during the start and end phases of the pipeline, reducing utilization.
Combined Use: vLLM supports using both parallelism strategies simultaneously for efficient deployment of giant models on large clusters.

Summary and Best Practices

Scenario	`temperature`	`top_p`	`repetition_penalty`	`gpu_memory_utilization`	`max_num_seqs`	`max_model_len`
Code Generation/Factual Q&A	`0.0` - `0.2`	(Not recommended to modify)	`1.0`	`0.9` (Default)	Adjust based on concurrency	Set as needed
Article Summarization/Translation	`0.2` - `0.5`	(Not recommended to modify)	`1.1`	`0.9`	Adjust based on concurrency	Set to maximum possible document length
General Chat/Copywriting	`0.7` (Default)	`0.9` (Recommended)	`1.1` - `1.2`	`0.9`	Adjust based on concurrency	Set as needed, e.g., `4096`\|
Creative Writing/Brainstorming	`0.8` - `1.2`	`0.95`	`1.0`	`0.9`	Adjust based on concurrency	Set as needed
High Concurrency Throughput Optimization	(Task dependent)	(Task dependent)	(Task dependent)	Try `0.9` - `0.95`	Gradually increase	Set to the minimum value that meets business needs
Low Latency Interaction Optimization	(Task dependent)	(Task dependent)	(Task dependent)	`0.9` (Default)	Set to lower values (e.g., `16-64`)	Set as needed
Extremely Memory Constrained	(Task dependent)	(Task dependent)	(Task dependent)	Lower to `0.8`	Set to lower values	Set to the minimum value that meets business needs

Final Recommendations:

Start with Generation Parameters: First adjust temperature or top_p to achieve satisfactory output quality.
Set Deployment Parameters as Needed: When deploying, first set max_model_len to a reasonable minimum value based on your application scenario.
Monitor and Iterate: Start with the default gpu_memory_utilization=0.9 and a moderate max_num_seqs. Observe memory usage and preemption situations through monitoring tools (such as nvidia-smi and vLLM logs), then gradually adjust these values to find the optimal balance for your specific hardware and workload.