<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Hyperparameter Tuning | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/hyperparameter-tuning/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/hyperparameter-tuning/index.xml" rel="self" type="application/rss+xml"/><description>Hyperparameter Tuning</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Fri, 27 Jun 2025 03:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Hyperparameter Tuning</title><link>https://ziyanglin.netlify.app/en/tags/hyperparameter-tuning/</link></image><item><title>LLM Hyperparameter Tuning Guide: A Comprehensive Analysis from Generation to Deployment</title><link>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</link><pubDate>Fri, 27 Jun 2025 03:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llm-hyperparameters-documentation/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;h2 id="span-stylefontsize-09embehind-the-powerful-capabilities-of-large-language-models-llms-is-a-series-of-complex-hyperparameters-working-silently-whether-youre-deploying-a-local-inference-service-like-vllm-or-calling-openais-api-precisely-tuning-these-parameters-is-crucial-for-achieving-ideal-performance-cost-and-output-quality-this-document-provides-a-detailed-analysis-of-two-key-categories-of-hyperparameters-generation-sampling-parameters-and-deployment-serving-parameters-helping-you-fully-master-their-functions-values-impacts-and-best-practices-across-different-scenariosspan">&lt;span style="font-size: 0.9em;">Behind the powerful capabilities of large language models (LLMs) is a series of complex hyperparameters working silently. Whether you're deploying a local inference service like vLLM or calling OpenAI's API, precisely tuning these parameters is crucial for achieving ideal performance, cost, and output quality. This document provides a detailed analysis of two key categories of hyperparameters: &lt;strong>Generation (Sampling) Parameters&lt;/strong> and &lt;strong>Deployment (Serving) Parameters&lt;/strong>, helping you fully master their functions, values, impacts, and best practices across different scenarios.&lt;/span>&lt;/h2>
&lt;h3 id="part-1-generation-sampling-parameters--controlling-model-creativity-and-determinism">Part 1: Generation (Sampling) Parameters — Controlling Model Creativity and Determinism&lt;/h3>
&lt;p>Generation parameters directly control the model's behavior when generating the next token. They primarily revolve around a core question: how to select from thousands of possible next words in the probability distribution provided by the model.&lt;/p>
&lt;h3 id="1-temperature">1. &lt;code>temperature&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the randomness of generated text. Higher &lt;code>temperature&lt;/code> increases randomness, making responses more creative and diverse; lower &lt;code>temperature&lt;/code> decreases randomness, making responses more deterministic and conservative.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
When generating the next token, the model calculates &lt;code>logits&lt;/code> (raw, unnormalized prediction scores) for all words in the vocabulary. Typically, we use the &lt;code>Softmax&lt;/code> function to convert these &lt;code>logits&lt;/code> into a probability distribution. The &lt;code>temperature&lt;/code> parameter is introduced before the &lt;code>Softmax&lt;/code> calculation, &amp;ldquo;smoothing&amp;rdquo; or &amp;ldquo;sharpening&amp;rdquo; this probability distribution.&lt;/p>
&lt;p>The standard Softmax formula is: &lt;code>P(i) = exp(logit_i) / Σ_j(exp(logit_j))&lt;/code>&lt;/p>
&lt;p>With &lt;code>temperature&lt;/code> (T) introduced, the formula becomes: &lt;code>P(i) = exp(logit_i / T) / Σ_j(exp(logit_j / T))&lt;/code>&lt;/p>
&lt;ul>
&lt;li>When &lt;code>T&lt;/code> -&amp;gt; 0, the differences in &lt;code>logit_i / T&lt;/code> become dramatically amplified. The token with the highest logit approaches a probability of 1, while all other tokens approach 0. This causes the model to almost always choose the most likely word, behaving very deterministically and &amp;ldquo;greedily.&amp;rdquo;&lt;/li>
&lt;li>When &lt;code>T&lt;/code> = 1, the formula reverts to standard Softmax, and the model behaves in its &amp;ldquo;original&amp;rdquo; state.&lt;/li>
&lt;li>When &lt;code>T&lt;/code> &amp;gt; 1, the differences in &lt;code>logit_i / T&lt;/code> are reduced. Tokens with originally lower probabilities get boosted, making the entire probability distribution &amp;ldquo;flatter.&amp;rdquo; This increases the chance of selecting less common words, introducing more randomness and creativity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>[0.0, 2.0]&lt;/code> (theoretically can be higher, but OpenAI API typically limits to 2.0).&lt;/li>
&lt;li>&lt;strong>&lt;code>temperature&lt;/code> = 0.0:&lt;/strong> Suitable for scenarios requiring deterministic, reproducible, and highly accurate outputs. Examples: code generation, factual Q&amp;amp;A, text classification, data extraction. With identical inputs, outputs will be almost identical (unless the model itself is updated).&lt;/li>
&lt;li>&lt;strong>Low &lt;code>temperature&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.4&lt;/code>):&lt;/strong> Suitable for semi-creative tasks requiring rigor and fidelity to source material. Examples: article summarization, translation, customer service bots. Outputs will vary slightly but remain faithful to core content.&lt;/li>
&lt;li>&lt;strong>Medium &lt;code>temperature&lt;/code> (e.g., &lt;code>0.5&lt;/code> - &lt;code>0.8&lt;/code>):&lt;/strong> A good balance between creativity and consistency, recommended as the default for most applications. Examples: writing emails, marketing copy, brainstorming.&lt;/li>
&lt;li>&lt;strong>High &lt;code>temperature&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.5&lt;/code>):&lt;/strong> Suitable for highly creative tasks. Examples: poetry writing, story creation, dialogue script generation. Outputs will be very diverse and sometimes surprising, but may occasionally produce meaningless or incoherent content.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Note:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>It's generally not recommended to modify both &lt;code>temperature&lt;/code> and &lt;code>top_p&lt;/code> simultaneously; it's better to adjust just one. OpenAI's documentation explicitly states that modifying only one is typically advised.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-topp-nucleus-sampling">2. &lt;code>top_p&lt;/code> (Nucleus Sampling)&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls generation diversity by dynamically determining the sampling pool size through a cumulative probability threshold (&lt;code>p&lt;/code>) of the highest probability tokens.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
&lt;code>top_p&lt;/code> is a more intelligent sampling strategy than &lt;code>temperature&lt;/code>, also known as &lt;strong>Nucleus Sampling&lt;/strong>. Instead of adjusting all token probabilities, it directly defines a &amp;ldquo;core&amp;rdquo; candidate set.&lt;/p>
&lt;p>The specific steps are as follows:&lt;/p>
&lt;ol>
&lt;li>The model calculates the probability distribution for all candidate tokens.&lt;/li>
&lt;li>All tokens are sorted by probability from highest to lowest.&lt;/li>
&lt;li>Starting from the highest probability token, their probabilities are cumulatively added until this sum exceeds the set &lt;code>top_p&lt;/code> threshold.&lt;/li>
&lt;li>All tokens included in this cumulative sum form the &amp;ldquo;nucleus&amp;rdquo; for sampling.&lt;/li>
&lt;li>The model will only sample from this nucleus (typically renormalizing their probabilities), and all other tokens are ignored.&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Example:&lt;/strong> Assume &lt;code>top_p&lt;/code> = &lt;code>0.9&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>If the highest probability token &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.95&lt;/code>, then the nucleus will contain only &amp;ldquo;the&amp;rdquo;, and the model will choose it 100%.&lt;/li>
&lt;li>If &amp;ldquo;the&amp;rdquo; has a probability of &lt;code>0.5&lt;/code>, &amp;ldquo;a&amp;rdquo; has &lt;code>0.3&lt;/code>, and &amp;ldquo;an&amp;rdquo; has &lt;code>0.1&lt;/code>, then the cumulative probability of these three words is &lt;code>0.9&lt;/code>. The nucleus will contain {&amp;ldquo;the&amp;rdquo;, &amp;ldquo;a&amp;rdquo;, &amp;ldquo;an&amp;rdquo;}. The model will sample from these three words according to their (renormalized) probabilities.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_p&lt;/code> = 1.0:&lt;/strong> Means the model considers all tokens without any truncation (equivalent to no &lt;code>top_p&lt;/code>).&lt;/li>
&lt;li>&lt;strong>High &lt;code>top_p&lt;/code> (e.g., &lt;code>0.9&lt;/code> - &lt;code>1.0&lt;/code>):&lt;/strong> Allows for more diverse choices, suitable for creative tasks, similar in effect to higher &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Low &lt;code>top_p&lt;/code> (e.g., &lt;code>0.1&lt;/code> - &lt;code>0.3&lt;/code>):&lt;/strong> Greatly restricts the model's range of choices, making its output very deterministic and conservative, similar in effect to extremely low &lt;code>temperature&lt;/code>.&lt;/li>
&lt;li>&lt;strong>General Recommended Value:&lt;/strong> &lt;code>0.9&lt;/code> is a very common default value as it maintains high quality while allowing for some diversity.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>top_p&lt;/code> vs &lt;code>temperature&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;code>top_p&lt;/code> is more dynamic and adaptive. When the model is very confident about the next step (sharp probability distribution), &lt;code>top_p&lt;/code> automatically narrows the candidate set, ensuring quality. When the model is less confident (flat distribution), it expands the candidate set, increasing diversity.&lt;/li>
&lt;li>&lt;code>temperature&lt;/code> adjusts the entire distribution &amp;ldquo;equally,&amp;rdquo; regardless of whether the distribution itself is sharp or flat.&lt;/li>
&lt;li>Therefore, &lt;code>top_p&lt;/code> is generally considered a safer and more robust method for controlling diversity than &lt;code>temperature&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-topk">3. &lt;code>top_k&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Simply and directly samples only from the &lt;code>k&lt;/code> tokens with the highest probabilities.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> This is the simplest truncation sampling method. It directly selects the &lt;code>k&lt;/code> tokens with the highest probabilities to form the candidate set, then samples from these &lt;code>k&lt;/code> tokens. All other tokens are ignored.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Integers, such as &lt;code>1&lt;/code>, &lt;code>10&lt;/code>, &lt;code>50&lt;/code>.&lt;/li>
&lt;li>&lt;strong>&lt;code>top_k&lt;/code> = 1:&lt;/strong> Equivalent to greedy search, always choosing the most likely word.&lt;/li>
&lt;li>&lt;strong>Recommendation:&lt;/strong> &lt;code>top_k&lt;/code> is typically not the preferred sampling strategy because it's too &amp;ldquo;rigid.&amp;rdquo; In cases where the probability distribution is very flat, it might accidentally exclude many reasonable words; while in cases where the distribution is very sharp, it might include many extremely low-probability, useless words. &lt;code>top_p&lt;/code> is usually a better choice.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-repetitionpenalty">4. &lt;code>repetition_penalty&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Applies a penalty to tokens that have already appeared in the context, reducing their probability of being selected again, thereby reducing repetitive content.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong> After calculating &lt;code>logits&lt;/code> but before &lt;code>Softmax&lt;/code>, this parameter iterates through all candidate tokens. If a token has already appeared in the previous context, its &lt;code>logit&lt;/code> value is reduced (typically divided by the value of &lt;code>repetition_penalty&lt;/code>).&lt;/p>
&lt;p>&lt;code>new_logit = logit / penalty&lt;/code> (if token has appeared)
&lt;code>new_logit = logit&lt;/code> (if token has not appeared)&lt;/p>
&lt;p>This way, the final probability of words that have already appeared decreases.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>1.0&lt;/code> to &lt;code>2.0&lt;/code> is common.&lt;/li>
&lt;li>&lt;strong>&lt;code>1.0&lt;/code>:&lt;/strong> No penalty applied (default value).&lt;/li>
&lt;li>&lt;strong>&lt;code>1.1&lt;/code> - &lt;code>1.3&lt;/code>:&lt;/strong> A relatively safe range that can effectively reduce unnecessary repetition without overly affecting normal language expression (such as necessary articles like &amp;ldquo;the&amp;rdquo;).&lt;/li>
&lt;li>&lt;strong>Too High Values:&lt;/strong> May cause the model to deliberately avoid common words, producing unnatural or even strange sentences.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="5-frequencypenalty--presencepenalty">5. &lt;code>frequency_penalty&lt;/code> &amp;amp; &lt;code>presence_penalty&lt;/code>&lt;/h3>
&lt;p>These two parameters are more refined versions of &lt;code>repetition_penalty&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>presence_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Applies a fixed penalty to all tokens that have &lt;strong>appeared at least once&lt;/strong> in the context. It doesn't care how many times the token has appeared; as long as it has appeared, it gets penalized.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - presence_penalty&lt;/code> (if token has appeared at least once).&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is useful when you want to encourage the model to introduce entirely new concepts and vocabulary, rather than repeatedly discussing topics that have already been mentioned.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>. Positive values penalize new tokens, negative values encourage them.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>frequency_penalty&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> The penalty is proportional to the &lt;strong>frequency&lt;/strong> of the token in the context. The more times a word appears, the heavier the penalty it receives.&lt;/li>
&lt;li>&lt;strong>Underlying Principle:&lt;/strong> &lt;code>new_logit = logit - count(token) * frequency_penalty&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> This parameter is effective when you find the model tends to repeatedly use certain specific high-frequency words (even if they are necessary), leading to monotonous language.&lt;/li>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>0.0&lt;/code> to &lt;code>2.0&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Summary:&lt;/strong> &lt;code>presence_penalty&lt;/code> addresses the question of &amp;ldquo;whether it has appeared,&amp;rdquo; while &lt;code>frequency_penalty&lt;/code> addresses &amp;ldquo;how many times it has appeared.&amp;rdquo;&lt;/p>
&lt;/li>
&lt;/ul>
&lt;h3 id="6-seed">6. &lt;code>seed&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> By providing a fixed &lt;code>seed&lt;/code>, you can make the model's output reproducible when other parameters (such as &lt;code>temperature&lt;/code>) remain the same.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> In machine learning, many operations that seem random are actually &amp;ldquo;pseudo-random,&amp;rdquo; determined by an initial &amp;ldquo;seed.&amp;rdquo; Setting the same seed will produce the same sequence of random numbers. In LLMs, this means the sampling process will be completely deterministic.&lt;/li>
&lt;li>&lt;strong>Scenarios:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Debugging and Testing:&lt;/strong> When you need to verify whether a change has affected the output, fixing the &lt;code>seed&lt;/code> can eliminate randomness interference.&lt;/li>
&lt;li>&lt;strong>Reproducible Research:&lt;/strong> Reproducibility is crucial in academic research.&lt;/li>
&lt;li>&lt;strong>Generating Consistent Content:&lt;/strong> When you need the model to consistently produce outputs in the same style for the same input.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Note:&lt;/strong> For complete reproduction, &lt;strong>all&lt;/strong> generation parameters (&lt;code>prompt&lt;/code>, &lt;code>model&lt;/code>, &lt;code>temperature&lt;/code>, &lt;code>top_p&lt;/code>, etc.) must be identical.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="part-2-deployment-serving-parameters--optimizing-service-performance-and-capacity">Part 2: Deployment (Serving) Parameters — Optimizing Service Performance and Capacity&lt;/h3>
&lt;p>Deployment parameters determine how an LLM inference service manages GPU resources, handles concurrent requests, and optimizes overall throughput and latency. These parameters are particularly important in high-performance inference engines like vLLM.&lt;/p>
&lt;h3 id="1-gpumemoryutilization">1. &lt;code>gpu_memory_utilization&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Controls the proportion of GPU memory that vLLM can use, with the core purpose of reserving space for the &lt;strong>KV Cache&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle (PagedAttention):&lt;/strong>
The core of vLLM is the PagedAttention mechanism. Traditional attention mechanisms pre-allocate a continuous, maximum-length memory space for each request to store the Key-Value (KV) Cache. This leads to severe memory waste, as most requests are far shorter than the maximum length.&lt;/p>
&lt;p>PagedAttention manages the KV Cache like virtual memory in an operating system:&lt;/p>
&lt;ol>
&lt;li>It breaks down each sequence's KV Cache into many small, fixed-size &amp;ldquo;blocks.&amp;rdquo;&lt;/li>
&lt;li>These blocks can be stored non-contiguously in GPU memory.&lt;/li>
&lt;li>A central &amp;ldquo;Block Manager&amp;rdquo; is responsible for allocating and releasing these blocks.&lt;/li>
&lt;/ol>
&lt;p>&lt;code>gpu_memory_utilization&lt;/code> tells vLLM: &amp;ldquo;You can use this much proportion of the total GPU memory for free management (mainly storing model weights and physical blocks of KV Cache).&amp;rdquo;&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> &lt;code>(0.0, 1.0]&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Default Value:&lt;/strong> &lt;code>0.9&lt;/code> (i.e., 90%).&lt;/li>
&lt;li>&lt;strong>Higher Values (e.g., &lt;code>0.95&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> vLLM has more memory for KV Cache, supporting longer contexts and larger batch sizes, thereby increasing throughput.&lt;/li>
&lt;li>&lt;strong>Risk:&lt;/strong> If set too high, there might not be enough spare memory for CUDA kernels, drivers, or other system processes, easily leading to &lt;strong>OOM (Out of Memory)&lt;/strong> errors.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values (e.g., &lt;code>0.8&lt;/code>):&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Safer, less prone to OOM, reserves more memory for the system and other applications.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Reduced available space for KV Cache, potentially causing vLLM to struggle with high concurrency or long sequence requests, degrading performance. When KV Cache is insufficient, vLLM triggers &lt;strong>Preemption&lt;/strong>, swapping out some running sequences and waiting to swap them back in when there's enough space, severely affecting latency. vLLM's warning log &lt;code>&amp;quot;there is not enough KV cache space. This can affect the end-to-end performance.&amp;quot;&lt;/code> is reminding you of this issue.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Start with the default value of &lt;code>0.9&lt;/code>.&lt;/li>
&lt;li>If you encounter OOM, gradually lower this value.&lt;/li>
&lt;li>If you encounter many preemption warnings and confirm no other processes are occupying large amounts of GPU memory, you can gradually increase this value.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="2-maxnumseqs">2. &lt;code>max_num_seqs&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Limits the maximum number of sequences (requests) that the vLLM scheduler can process &lt;strong>in one iteration (or one batch)&lt;/strong>.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
vLLM's scheduler selects a batch of requests from the waiting queue in each processing cycle. This parameter directly limits the size of this &amp;ldquo;batch.&amp;rdquo; Together with &lt;code>max_num_batched_tokens&lt;/code> (which limits the total number of tokens across all sequences in a batch), it determines the scale of batch processing.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, such as &lt;code>16&lt;/code>, &lt;code>64&lt;/code>, &lt;code>256&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Allows for higher concurrency, potentially improving GPU utilization and overall throughput.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Requires more intermediate memory (e.g., for storing &lt;code>logits&lt;/code> and sampling states) and may increase the latency of individual batches. If set too high, even if KV Cache still has space, OOM might occur due to insufficient temporary memory.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> More memory-friendly, potentially lower latency for individual batches.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Limits concurrency capability, potentially leading to underutilization of GPU and decreased throughput.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>This value needs to be adjusted based on your GPU memory size, model size, and expected concurrent load.&lt;/li>
&lt;li>For high-concurrency scenarios, try gradually increasing this value while monitoring GPU utilization and memory usage.&lt;/li>
&lt;li>For interactive, low-latency scenarios, consider setting this value lower.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-maxmodellen">3. &lt;code>max_model_len&lt;/code>&lt;/h3>
&lt;p>&lt;strong>In one sentence:&lt;/strong> Sets the &lt;strong>maximum context length&lt;/strong> the model can process (including both prompt and generated tokens).&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Underlying Principle:&lt;/strong>
This parameter directly determines how much logical space vLLM needs to reserve for the KV Cache. For example, if &lt;code>max_model_len&lt;/code> = &lt;code>4096&lt;/code>, vLLM must ensure its memory management mechanism can support storing KV pairs for up to &lt;code>4096&lt;/code> tokens per sequence.
This affects vLLM's memory planning at startup, such as the size of Position Embeddings.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Value Range and Impact:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Range:&lt;/strong> Positive integers, cannot exceed the maximum length the model was originally trained on.&lt;/li>
&lt;li>&lt;strong>Higher Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can handle longer documents and more complex contexts.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> &lt;strong>Significantly increases&lt;/strong> memory consumption. Each token needs to store KV Cache; doubling the length roughly doubles the memory usage. Even if current requests are short, vLLM needs to prepare for potentially long requests, which occupies more KV Cache blocks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Lower Values:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> &lt;strong>Significantly saves&lt;/strong> GPU memory. If you know your application scenario will never exceed 1024 tokens, setting this value to 1024 instead of the default 4096 or 8192 will free up a large amount of KV Cache space, supporting higher concurrency.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Any requests exceeding this length will be rejected or truncated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Recommendations:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Set as needed!&lt;/strong> This is one of the most effective parameters for optimizing vLLM memory usage. Based on your actual application scenario, set this value to a reasonable maximum with some margin.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="4-tensorparallelsize--pipelineparallelsize">4. &lt;code>tensor_parallel_size&lt;/code> &amp;amp; &lt;code>pipeline_parallel_size&lt;/code>&lt;/h3>
&lt;p>These two parameters are used for deploying extremely large models across multiple GPUs or nodes.&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>&lt;code>tensor_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Divides &lt;strong>each layer&lt;/strong> of the model (such as a large weight matrix) into &lt;code>N&lt;/code> parts (&lt;code>N&lt;/code> = &lt;code>tensor_parallel_size&lt;/code>), placing them on &lt;code>N&lt;/code> different GPUs. During computation, each GPU only processes its own portion of the data, then exchanges necessary results through high-speed interconnects (like NVLink) via All-Reduce operations, finally merging to get the complete output.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when a single model's volume exceeds the memory of a single GPU. For example, a 70B model cannot fit into a single 40GB A100, but can be deployed across two A100s by setting &lt;code>tensor_parallel_size=2&lt;/code>.&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Achieves model parallelism, solving the problem of models not fitting on a single card.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Introduces significant cross-GPU communication overhead, potentially affecting latency. Requires high-speed interconnects between GPUs.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>&lt;code>pipeline_parallel_size&lt;/code>:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Function:&lt;/strong> Assigns &lt;strong>different layers&lt;/strong> of the model to different GPUs or nodes. For example, placing layers 1-10 on GPU 1, layers 11-20 on GPU 2, and so on. Data flows through these GPUs like a pipeline.&lt;/li>
&lt;li>&lt;strong>Scenario:&lt;/strong> Used when the model is extremely large and needs to be deployed across multiple nodes (machines).&lt;/li>
&lt;li>&lt;strong>Impact:&lt;/strong>
&lt;ul>
&lt;li>&lt;strong>Advantage:&lt;/strong> Can scale the model to any number of GPUs/nodes.&lt;/li>
&lt;li>&lt;strong>Disadvantage:&lt;/strong> Creates &amp;ldquo;pipeline bubbles&amp;rdquo; as additional overhead, where some GPUs are idle during the start and end phases of the pipeline, reducing utilization.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Combined Use:&lt;/strong>
vLLM supports using both parallelism strategies simultaneously for efficient deployment of giant models on large clusters.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="summary-and-best-practices">Summary and Best Practices&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Scenario&lt;/th>
&lt;th align="left">&lt;code>temperature&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>top_p&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>repetition_penalty&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>gpu_memory_utilization&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_num_seqs&lt;/code>&lt;/th>
&lt;th align="left">&lt;code>max_model_len&lt;/code>&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Code Generation/Factual Q&amp;amp;A&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.0&lt;/code> - &lt;code>0.2&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Article Summarization/Translation&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.2&lt;/code> - &lt;code>0.5&lt;/code>&lt;/td>
&lt;td align="left">(Not recommended to modify)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set to maximum possible document length&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>General Chat/Copywriting&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.7&lt;/code> (Default)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Recommended)&lt;/td>
&lt;td align="left">&lt;code>1.1&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed, e.g., &lt;code>4096&lt;/code>|&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Creative Writing/Brainstorming&lt;/strong>&lt;/td>
&lt;td align="left">&lt;code>0.8&lt;/code> - &lt;code>1.2&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>1.0&lt;/code>&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code>&lt;/td>
&lt;td align="left">Adjust based on concurrency&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>High Concurrency Throughput Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Try &lt;code>0.9&lt;/code> - &lt;code>0.95&lt;/code>&lt;/td>
&lt;td align="left">Gradually increase&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Low Latency Interaction Optimization&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">&lt;code>0.9&lt;/code> (Default)&lt;/td>
&lt;td align="left">Set to lower values (e.g., &lt;code>16-64&lt;/code>)&lt;/td>
&lt;td align="left">Set as needed&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Extremely Memory Constrained&lt;/strong>&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">(Task dependent)&lt;/td>
&lt;td align="left">Lower to &lt;code>0.8&lt;/code>&lt;/td>
&lt;td align="left">Set to lower values&lt;/td>
&lt;td align="left">Set to the &lt;strong>minimum&lt;/strong> value that meets business needs&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>Final Recommendations:&lt;/strong>&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Start with Generation Parameters:&lt;/strong> First adjust &lt;code>temperature&lt;/code> or &lt;code>top_p&lt;/code> to achieve satisfactory output quality.&lt;/li>
&lt;li>&lt;strong>Set Deployment Parameters as Needed:&lt;/strong> When deploying, first set &lt;code>max_model_len&lt;/code> to a reasonable minimum value based on your application scenario.&lt;/li>
&lt;li>&lt;strong>Monitor and Iterate:&lt;/strong> Start with the default &lt;code>gpu_memory_utilization=0.9&lt;/code> and a moderate &lt;code>max_num_seqs&lt;/code>. Observe memory usage and preemption situations through monitoring tools (such as &lt;code>nvidia-smi&lt;/code> and vLLM logs), then gradually adjust these values to find the optimal balance for your specific hardware and workload.&lt;/li>
&lt;/ol></description></item></channel></rss>