<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Parameter-Efficient Fine-Tuning | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/parameter-efficient-fine-tuning/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/parameter-efficient-fine-tuning/index.xml" rel="self" type="application/rss+xml"/><description>Parameter-Efficient Fine-Tuning</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Thu, 26 Jun 2025 00:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Parameter-Efficient Fine-Tuning</title><link>https://ziyanglin.netlify.app/en/tags/parameter-efficient-fine-tuning/</link></image><item><title>LoRA Technical Guide: Parameter-Efficient Fine-Tuning for Large Models</title><link>https://ziyanglin.netlify.app/en/post/lora-documentation/</link><pubDate>Thu, 26 Jun 2025 00:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/lora-documentation/</guid><description>&lt;h2 id="1-introduction-why-lora">1. Introduction: Why LoRA?&lt;/h2>
&lt;p>In today's rapidly evolving landscape of Large Language Models (LLMs) and generative AI, we've witnessed an explosive growth in model sizes, ranging from hundreds of millions to trillions of parameters. These massive models demonstrate remarkable capabilities across various tasks. However, a significant challenge emerges: how can we fine-tune these models for specific downstream tasks?&lt;/p>
&lt;p>The traditional &lt;strong>Full Fine-Tuning&lt;/strong> approach, which updates all parameters of a model, faces severe challenges:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High computational cost&lt;/strong>: Fine-tuning a model with billions of parameters requires enormous computational resources and hundreds of GB of GPU memory, which is prohibitively expensive for most developers and small to medium-sized enterprises.&lt;/li>
&lt;li>&lt;strong>Massive storage requirements&lt;/strong>: Each fine-tuned model for a specific task requires storing a complete model copy, leading to rapidly escalating storage costs.&lt;/li>
&lt;li>&lt;strong>Deployment difficulties&lt;/strong>: Maintaining and switching between multiple massive model copies for different tasks in a production environment is a nightmare.&lt;/li>
&lt;/ul>
&lt;p>To address these pain points, &lt;strong>Parameter-Efficient Fine-Tuning (PEFT)&lt;/strong> techniques have emerged. The core idea is to freeze most parameters of the pre-trained model during fine-tuning and only adjust a small portion (typically far less than 1% of the total) of new or specific parameters.&lt;/p>
&lt;p>Among the various PEFT techniques, &lt;strong>LoRA (Low-Rank Adaptation of Large Language Models)&lt;/strong> stands out for its excellent performance, efficiency, and implementation simplicity, becoming one of the most mainstream and widely applied solutions today. This document will provide an in-depth yet accessible introduction to the core principles of LoRA and offer detailed practical guidance.&lt;/p>
&lt;h2 id="2-core-principles-the-magic-of-lora">2. Core Principles: The Magic of LoRA&lt;/h2>
&lt;p>LoRA's core assumption is that &lt;strong>the weight changes in large language models when adapting to new tasks are low-rank&lt;/strong>. In other words, although the weight matrix &lt;code>W&lt;/code> of the pre-trained model is very large (e.g., &lt;code>d x d&lt;/code> dimensions), the weight change &lt;code>ΔW&lt;/code> during fine-tuning has a very low &amp;ldquo;intrinsic rank.&amp;rdquo;&lt;/p>
&lt;p>Based on this assumption, LoRA doesn't directly update &lt;code>W&lt;/code>, but instead approximates &lt;code>ΔW&lt;/code> by training two smaller, low-rank matrices &lt;code>B&lt;/code> and &lt;code>A&lt;/code>, such that &lt;code>ΔW ≈ BA&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>&lt;code>W&lt;/code> is the pre-trained, frozen weight matrix.&lt;/li>
&lt;li>&lt;code>A&lt;/code> is an &lt;code>r x d&lt;/code> dimensional matrix, where &lt;code>r&lt;/code> is a rank much smaller than &lt;code>d&lt;/code>.&lt;/li>
&lt;li>&lt;code>B&lt;/code> is a &lt;code>d x r&lt;/code> dimensional matrix.&lt;/li>
&lt;/ul>
&lt;p>During fine-tuning, only the parameters of matrices &lt;code>A&lt;/code> and &lt;code>B&lt;/code> are trainable. The forward propagation computation process is accordingly changed to:&lt;/p>
&lt;p>&lt;code>h = Wx + BAx&lt;/code>&lt;/p>
&lt;p>Here's a diagram that illustrates this process more intuitively:&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[Input x] --&amp;gt; B(Pre-trained weights W);
A --&amp;gt; C(Low-rank matrix A);
C --&amp;gt; D(Low-rank matrix B);
B --&amp;gt; E[Wx];
D --&amp;gt; F[BAx];
E --&amp;gt; G((Sum));
F --&amp;gt; G;
G --&amp;gt; H[Final output h];
style B fill:#eee,stroke:#333,stroke-width:2px,stroke-dasharray: 5, 5
style C fill:#9cf,stroke:#333,stroke-width:2px
style D fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>Where &lt;code>x&lt;/code> is the input and &lt;code>h&lt;/code> is the output. This approach greatly reduces the number of parameters that need to be trained. For example, if &lt;code>d = 4096&lt;/code> and &lt;code>r = 8&lt;/code>, the original matrix &lt;code>W&lt;/code> has &lt;code>4096 * 4096 ≈ 16.7M&lt;/code> parameters, while &lt;code>A&lt;/code> and &lt;code>B&lt;/code> together have only &lt;code>4096 * 8 + 8 * 4096 ≈ 65K&lt;/code> parameters, reducing the parameter count by approximately 256 times!&lt;/p>
&lt;p>&lt;strong>Key parameter &lt;code>r&lt;/code>&lt;/strong>: The rank &lt;code>r&lt;/code> is the most important hyperparameter in LoRA. It controls the size of the low-rank matrices and directly determines the number of new parameters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Smaller &lt;code>r&lt;/code>&lt;/strong>: Fewer trainable parameters, faster training speed, lower memory usage, but may not fully capture complex features of the task.&lt;/li>
&lt;li>&lt;strong>Larger &lt;code>r&lt;/code>&lt;/strong>: More trainable parameters, stronger model fitting capability, but increases computational cost and risk of overfitting.
In practice, &lt;code>r&lt;/code> is typically set to 8, 16, 32, or 64, which achieves a good balance between performance and efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="3-significant-advantages-of-lora">3. Significant Advantages of LoRA&lt;/h2>
&lt;p>Compared to full fine-tuning, LoRA demonstrates overwhelming advantages in multiple aspects:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Extreme parameter efficiency&lt;/strong>: As mentioned above, LoRA only requires training a tiny fraction of parameters. We can see this intuitively through the &lt;code>print_trainable_parameters()&lt;/code> function, where the proportion of trained parameters is typically less than 1%.&lt;/li>
&lt;li>&lt;strong>Faster training speed&lt;/strong>: With a significantly reduced number of parameters for gradient computation and updates, training time is also shortened, accelerating the iteration cycle.&lt;/li>
&lt;li>&lt;strong>Lower hardware requirements&lt;/strong>: LoRA significantly reduces GPU memory (VRAM) usage during training, making it possible to fine-tune models with tens of billions of parameters on consumer-grade GPUs (such as RTX 3090/4090).&lt;/li>
&lt;li>&lt;strong>Flexibility in deployment and management&lt;/strong>: This is one of LoRA's most attractive advantages. The pre-trained model remains unchanged and can be shared across all tasks. For each downstream task, we only need to save a lightweight (typically just a few MB to tens of MB) LoRA adapter (i.e., the weights of matrices A and B). During deployment, the appropriate adapter can be loaded dynamically according to needs, greatly simplifying model management and switching in multi-task scenarios.&lt;/li>
&lt;/ol>
&lt;h2 id="4-handson-practice-lora-training-methods">4. Hands-on Practice: LoRA Training Methods&lt;/h2>
&lt;p>Below, we'll demonstrate a complete example of how to fine-tune a large model using LoRA with the &lt;code>transformers&lt;/code>, &lt;code>peft&lt;/code>, and &lt;code>trl&lt;/code> libraries from the Hugging Face ecosystem.&lt;/p>
&lt;h3 id="step-1-environment-preparation">Step 1: Environment Preparation&lt;/h3>
&lt;p>First, ensure you have installed the necessary Python libraries:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install transformers peft trl datasets torch
&lt;/code>&lt;/pre>
&lt;h3 id="step-2-load-model-tokenizer-and-dataset">Step 2: Load Model, Tokenizer, and Dataset&lt;/h3>
&lt;p>We select a pre-trained model as the foundation and load the corresponding tokenizer. At the same time, we load a dataset from the Hugging Face Hub for fine-tuning.&lt;/p>
&lt;pre>&lt;code class="language-python">from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from datasets import load_dataset
# Model ID, can be any supported Causal LM
model_id = &amp;quot;facebook/opt-350m&amp;quot;
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_id)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load dataset (using English quotes dataset as an example)
dataset = load_dataset(&amp;quot;Abirate/english_quotes&amp;quot;, split=&amp;quot;train&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="step-3-configure-lora-loraconfig">Step 3: Configure LoRA (&lt;code>LoraConfig&lt;/code>)&lt;/h3>
&lt;p>This is the core step of LoRA fine-tuning. We need to create a &lt;code>LoraConfig&lt;/code> object to define the behavior of the LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import LoraConfig
lora_config = LoraConfig(
r=16, # Rank of the low-rank matrices, recommended values are 8, 16, 32
lora_alpha=32, # Scaling factor, typically set to twice the value of r
target_modules=[&amp;quot;q_proj&amp;quot;, &amp;quot;v_proj&amp;quot;], # Specify which model layers to apply LoRA to. For Transformer models, typically q_proj and v_proj
lora_dropout=0.05, # Dropout probability for LoRA layers
bias=&amp;quot;none&amp;quot;, # Whether to train bias terms, &amp;quot;none&amp;quot; means not training
task_type=&amp;quot;CAUSAL_LM&amp;quot; # Task type, here it's causal language modeling
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>target_modules&lt;/code>: This parameter is crucial. It tells the PEFT library which modules (typically &lt;code>nn.Linear&lt;/code> layers) in the model should have LoRA applied. For most Transformer models, applying it to the query and value projection layers in the Attention mechanism (i.e., &lt;code>q_proj&lt;/code> and &lt;code>v_proj&lt;/code>) is a common practice. You can print the &lt;code>model&lt;/code> object to see the names of all its modules to determine which can be targeted.&lt;/li>
&lt;/ul>
&lt;h3 id="step-4-apply-lora-and-train-with-sfttrainer">Step 4: Apply LoRA and Train with &lt;code>SFTTrainer&lt;/code>&lt;/h3>
&lt;p>The &lt;code>SFTTrainer&lt;/code> (Supervised Fine-tuning Trainer) provided by the &lt;code>trl&lt;/code> library greatly simplifies the fine-tuning process. It has built-in support for &lt;code>peft&lt;/code>, so we just need to pass the model, tokenizer, dataset, and &lt;code>peft_config&lt;/code> to it.&lt;/p>
&lt;pre>&lt;code class="language-python">from trl import SFTTrainer
# Define training parameters
training_args = TrainingArguments(
output_dir=&amp;quot;./lora_finetuned_model&amp;quot;, # Model output directory
num_train_epochs=3, # Number of training epochs
per_device_train_batch_size=4, # Training batch size per device
logging_dir='./logs', # Logging directory
logging_steps=50, # Log every this many steps
learning_rate=2e-4, # Learning rate
)
# Initialize SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
peft_config=lora_config, # Pass in LoRA configuration
dataset_text_field=&amp;quot;quote&amp;quot;, # Field name containing text in the dataset
)
# Start training
trainer.train()
# Save the trained LoRA adapter
trainer.save_model()
&lt;/code>&lt;/pre>
&lt;p>After training is complete, an &lt;code>adapter_model.bin&lt;/code> file and an &lt;code>adapter_config.json&lt;/code> file will be generated in the &lt;code>output_dir&lt;/code> directory. These are the lightweight LoRA adapter we've trained.&lt;/p>
&lt;h3 id="step-5-inference-with-the-trained-lora-adapter">Step 5: Inference with the Trained LoRA Adapter&lt;/h3>
&lt;p>For inference, we first load the original pre-trained model, then load the trained LoRA adapter weights.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
# Load the original, non-fine-tuned model
base_model = AutoModelForCausalLM.from_pretrained(model_id)
# Load the LoRA adapter
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Now model_with_lora is a model with LoRA weights integrated, ready for inference
prompt = &amp;quot;The best way to predict the future is to&amp;quot;
inputs = tokenizer(prompt, return_tensors=&amp;quot;pt&amp;quot;)
# Generate text
outputs = model_with_lora.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
&lt;/code>&lt;/pre>
&lt;h2 id="5-lora-model-deployment-from-static-to-dynamic">5. LoRA Model Deployment: From Static to Dynamic&lt;/h2>
&lt;p>After training, efficiently deploying LoRA models into production environments is the crucial next step. LoRA deployment strategies mainly fall into two categories: &lt;strong>Weight Merging (Static Deployment)&lt;/strong> and &lt;strong>Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>. The following flowcharts illustrate these two paths:&lt;/p>
&lt;p>&lt;strong>Option 1: Weight Merging (Static Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[Base Model + LoRA Adapter];
B --&amp;gt; C[&amp;quot;Call merge_and_unload()&amp;quot;];
C --&amp;gt; D[Generate standalone full model];
D --&amp;gt; E[Standard deployment];
style D fill:#c9f,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;p>&lt;strong>Option 2: Dynamic Adapter Loading (Dynamic Deployment)&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-mermaid">graph TD
A[LoRA Training Complete] --&amp;gt; B[vLLM / TGI server];
B --&amp;gt; C[Load Base Model];
C --&amp;gt; D[Load multiple LoRA Adapters];
D --&amp;gt; E[On-demand inference combinations];
style E fill:#9cf,stroke:#333,stroke-width:2px
&lt;/code>&lt;/pre>
&lt;h3 id="option-1-weight-merging-and-standard-deployment-static">Option 1: Weight Merging and Standard Deployment (Static)&lt;/h3>
&lt;p>This is the simplest and most direct deployment approach. The core idea is to merge the lightweight LoRA adapter weights into the original base model weights, generating a new, standalone full model.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:
Using the &lt;code>merge_and_unload()&lt;/code> method from the &lt;code>peft&lt;/code> library, this process can be easily completed.&lt;/p>
&lt;pre>&lt;code class="language-python">from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assuming model_id and lora_path are defined
base_model = AutoModelForCausalLM.from_pretrained(model_id)
model_with_lora = PeftModel.from_pretrained(base_model, &amp;quot;./lora_finetuned_model&amp;quot;)
# Merge weights
merged_model = model_with_lora.merge_and_unload()
# Now merged_model is a standard Transformers model
# You can save it like any other model
merged_model.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
tokenizer.save_pretrained(&amp;quot;./merged_lora_model&amp;quot;)
&lt;/code>&lt;/pre>
&lt;p>Afterward, you can load and use this &lt;code>merged_lora_model&lt;/code> just like any regular Hugging Face model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Zero inference latency&lt;/strong>: After merging, the inference process is identical to a standard model, with no additional computational overhead.&lt;/li>
&lt;li>&lt;strong>Simple deployment&lt;/strong>: No need for any additional inference framework support, can be used directly with standard libraries like &lt;code>transformers&lt;/code>.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Loss of flexibility&lt;/strong>: For each LoRA adapter, you need to save and load a complete model copy, defeating the lightweight purpose of LoRA.&lt;/li>
&lt;li>&lt;strong>High storage cost&lt;/strong>: If you have multiple adapters, the storage overhead is enormous.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-2-highperformance-dynamic-deployment-with-vllm-recommended">Option 2: High-Performance Dynamic Deployment with vLLM (Recommended)&lt;/h3>
&lt;p>For scenarios requiring simultaneous service of multiple LoRA adapters, &lt;strong>vLLM&lt;/strong> is currently the industry-leading high-performance inference and serving engine. Through core technologies such as &lt;strong>PagedAttention&lt;/strong>, it achieves efficient management and dynamic loading of multiple LoRA adapters, delivering extremely high throughput without significantly sacrificing performance.&lt;/p>
&lt;p>&lt;strong>Method&lt;/strong>:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Install vLLM&lt;/strong>:&lt;/p>
&lt;pre>&lt;code class="language-bash">pip install vllm
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Start vLLM server&lt;/strong>:
Use the &lt;code>vllm serve&lt;/code> command to start an OpenAI-compatible API server. The key is to enable LoRA support with &lt;code>--enable-lora&lt;/code> and optionally preload adapters with &lt;code>--lora-modules&lt;/code>.&lt;/p>
&lt;pre>&lt;code class="language-bash"># lora_path points to your trained adapter directory
vllm serve meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules my_sql_lora=/path/to/your/sql_lora_adapter
&lt;/code>&lt;/pre>
&lt;p>Here, we've preloaded an adapter named &lt;code>my_sql_lora&lt;/code>.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Send inference requests&lt;/strong>:
You can send requests to the vLLM server using &lt;code>curl&lt;/code> or any HTTP client. Just specify the &lt;code>model&lt;/code> in the request body as the name of your loaded LoRA adapter.&lt;/p>
&lt;pre>&lt;code class="language-bash">curl http://localhost:8000/v1/completions \
-H &amp;quot;Content-Type: application/json&amp;quot; \
-d '{
&amp;quot;model&amp;quot;: &amp;quot;my_sql_lora&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Write a SQL query for all users.&amp;quot;,
&amp;quot;max_tokens&amp;quot;: 64
}'
&lt;/code>&lt;/pre>
&lt;p>vLLM will automatically route the request to the corresponding LoRA adapter for inference.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Using Python Client&lt;/strong>:
vLLM also provides a Python API for direct calls in code.&lt;/p>
&lt;pre>&lt;code class="language-python">from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
# Initialize LLM engine with LoRA support
llm = LLM(model=&amp;quot;meta-llama/Llama-2-7b-hf&amp;quot;, enable_lora=True)
sampling_params = SamplingParams(max_tokens=64)
# In the generate call, specify which adapter to use via lora_request
outputs = llm.generate(
&amp;quot;Write a SQL query for all users.&amp;quot;,
sampling_params,
lora_request=LoRARequest(&amp;quot;my_sql_lora&amp;quot;, 1, &amp;quot;/path/to/your/sql_lora_adapter&amp;quot;)
)
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;strong>Advantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Extremely high throughput&lt;/strong>: Designed for large-scale concurrent inference.&lt;/li>
&lt;li>&lt;strong>Dynamic flexibility&lt;/strong>: Can simultaneously serve hundreds or thousands of LoRA adapters, loading them on demand, perfect for multi-tenant scenarios.&lt;/li>
&lt;li>&lt;strong>Memory efficient&lt;/strong>: PagedAttention mechanism effectively manages GPU memory, avoiding waste.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Disadvantages&lt;/strong>:
&lt;ul>
&lt;li>&lt;strong>Slightly more complex deployment&lt;/strong>: Requires additional learning and configuration of vLLM service.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="option-3-other-dynamic-deployment-options-eg-tgi">Option 3: Other Dynamic Deployment Options (e.g., TGI)&lt;/h3>
&lt;p>Hugging Face's own &lt;strong>Text Generation Inference (TGI)&lt;/strong> is another powerful production-grade inference server. Similar to vLLM, TGI also supports loading multiple LoRA adapters at startup and dynamically applying them based on incoming request headers. It integrates best with the Hugging Face ecosystem and is a strong competitor to vLLM.&lt;/p>
&lt;h3 id="deployment-options-comparison-summary">Deployment Options Comparison Summary&lt;/h3>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th align="left">Feature&lt;/th>
&lt;th align="left">Weight Merging (Static)&lt;/th>
&lt;th align="left">vLLM (Dynamic)&lt;/th>
&lt;th align="left">TGI (Dynamic)&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td align="left">&lt;strong>Performance/Throughput&lt;/strong>&lt;/td>
&lt;td align="left">Highest (lowest single request latency)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Flexibility&lt;/strong>&lt;/td>
&lt;td align="left">Low (no dynamic capability)&lt;/td>
&lt;td align="left">Very High&lt;/td>
&lt;td align="left">High&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Deployment Complexity&lt;/strong>&lt;/td>
&lt;td align="left">Low&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;td align="left">Medium&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Memory Usage&lt;/strong>&lt;/td>
&lt;td align="left">Very High (N adapters = N times memory)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;td align="left">Low (efficient sharing)&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td align="left">&lt;strong>Suitable Scenarios&lt;/strong>&lt;/td>
&lt;td align="left">Single, fixed tasks&lt;/td>
&lt;td align="left">Multi-tenant, high-concurrency, multi-task scenarios&lt;/td>
&lt;td align="left">Production deployment in Hugging Face ecosystem&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;h2 id="6-advanced-topics">6. Advanced Topics&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Multi-adapter Management&lt;/strong>: PEFT supports dynamically adding, switching, and disabling multiple adapters on a single model using methods like &lt;code>model.add_adapter()&lt;/code> and &lt;code>model.set_adapter()&lt;/code>, providing great convenience for building flexible multi-task systems.&lt;/li>
&lt;/ul>
&lt;h2 id="7-conclusion">7. Conclusion&lt;/h2>
&lt;p>As a revolutionary parameter-efficient fine-tuning technique, LoRA successfully addresses the high cost challenges of fine-tuning in the era of large models. Through clever low-rank decomposition ideas, it greatly reduces computational resource and storage requirements while maintaining fine-tuning effectiveness. Combined with advanced inference engines like vLLM, LoRA deployment and service have become unprecedentedly efficient and flexible, driving the application of large models in more specific scenarios.&lt;/p></description></item></channel></rss>