<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Local Deployment | Ziyang Lin</title><link>https://ziyanglin.netlify.app/en/tags/local-deployment/</link><atom:link href="https://ziyanglin.netlify.app/en/tags/local-deployment/index.xml" rel="self" type="application/rss+xml"/><description>Local Deployment</description><generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Fri, 27 Jun 2025 02:00:00 +0000</lastBuildDate><image><url>https://ziyanglin.netlify.app/img/icon-192.png</url><title>Local Deployment</title><link>https://ziyanglin.netlify.app/en/tags/local-deployment/</link></image><item><title>Ollama Practical Guide: Local Deployment and Management of Large Language Models</title><link>https://ziyanglin.netlify.app/en/post/ollama-documentation/</link><pubDate>Fri, 27 Jun 2025 02:00:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/ollama-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Ollama is a powerful open-source tool designed to allow users to easily download, run, and manage large language models (LLMs) in local environments. Its core advantage lies in simplifying the deployment and use of complex models, enabling developers, researchers, and enthusiasts to experience and utilize state-of-the-art artificial intelligence technology on personal computers without specialized hardware or complex configurations.&lt;/p>
&lt;p>&lt;strong>Key Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Ease of Use:&lt;/strong> Complete model download, running, and interaction through simple command-line instructions.&lt;/li>
&lt;li>&lt;strong>Cross-Platform Support:&lt;/strong> Supports macOS, Windows, and Linux.&lt;/li>
&lt;li>&lt;strong>Rich Model Library:&lt;/strong> Supports numerous popular open-source models such as Llama 3, Mistral, Gemma, Phi-3, and more.&lt;/li>
&lt;li>&lt;strong>Highly Customizable:&lt;/strong> Through &lt;code>Modelfile&lt;/code>, users can easily customize model behavior, system prompts, and parameters.&lt;/li>
&lt;li>&lt;strong>API-Driven:&lt;/strong> Provides a REST API for easy integration with other applications and services.&lt;/li>
&lt;li>&lt;strong>Open Source Community:&lt;/strong> Has an active community continuously contributing new models and features.&lt;/li>
&lt;/ul>
&lt;p>This document will provide a comprehensive introduction to Ollama's various features, from basic fundamentals to advanced applications, helping you fully master this powerful tool.&lt;/p>
&lt;hr>
&lt;h2 id="2-quick-start">2. Quick Start&lt;/h2>
&lt;p>This section will guide you through installing and basic usage of Ollama.&lt;/p>
&lt;h3 id="21-installation">2.1 Installation&lt;/h3>
&lt;p>Visit the &lt;a href="https://ollama.com/">Ollama official website&lt;/a> to download and install the package suitable for your operating system.&lt;/p>
&lt;h3 id="22-running-your-first-model">2.2 Running Your First Model&lt;/h3>
&lt;p>After installation, open a terminal (or command prompt) and use the &lt;code>ollama run&lt;/code> command to download and run a model. For example, to run the Llama 3 model:&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama run llama3
&lt;/code>&lt;/pre>
&lt;p>On first run, Ollama will automatically download the required model files from the model library. Once the download is complete, you can directly converse with the model in the terminal.&lt;/p>
&lt;h3 id="23-managing-local-models">2.3 Managing Local Models&lt;/h3>
&lt;p>You can use the following commands to manage locally downloaded models:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>List Local Models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama list
&lt;/code>&lt;/pre>
&lt;p>This command displays the name, ID, size, and modification time of all downloaded models.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Remove Local Models:&lt;/strong>&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama rm &amp;lt;model_name&amp;gt;
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="3-core-concepts">3. Core Concepts&lt;/h2>
&lt;h3 id="31-modelfile">3.1 Modelfile&lt;/h3>
&lt;p>&lt;code>Modelfile&lt;/code> is one of Ollama's core features. It's a configuration file similar to &lt;code>Dockerfile&lt;/code> that allows you to define and create custom models. Through &lt;code>Modelfile&lt;/code>, you can:&lt;/p>
&lt;ul>
&lt;li>Specify a base model.&lt;/li>
&lt;li>Set model parameters (such as temperature, top_p, etc.).&lt;/li>
&lt;li>Define the model's system prompt.&lt;/li>
&lt;li>Customize the model's interaction template.&lt;/li>
&lt;li>Apply LoRA adapters.&lt;/li>
&lt;/ul>
&lt;p>A simple &lt;code>Modelfile&lt;/code> example:&lt;/p>
&lt;pre>&lt;code class="language-Modelfile"># Specify base model
FROM llama3
# Set model temperature
PARAMETER temperature 0.8
# Set system prompt
SYSTEM &amp;quot;&amp;quot;&amp;quot;
You are a helpful AI assistant. Your name is Roo.
&amp;quot;&amp;quot;&amp;quot;
&lt;/code>&lt;/pre>
&lt;p>Use the &lt;code>ollama create&lt;/code> command to create a new model based on a &lt;code>Modelfile&lt;/code>:&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama create my-custom-model -f ./Modelfile
&lt;/code>&lt;/pre>
&lt;h3 id="32-model-import">3.2 Model Import&lt;/h3>
&lt;p>Ollama supports importing models from external file systems, particularly from &lt;code>Safetensors&lt;/code> format weight files.&lt;/p>
&lt;p>In a &lt;code>Modelfile&lt;/code>, use the &lt;code>FROM&lt;/code> directive and provide the directory path containing &lt;code>safetensors&lt;/code> files:&lt;/p>
&lt;pre>&lt;code class="language-Modelfile">FROM /path/to/safetensors/directory
&lt;/code>&lt;/pre>
&lt;p>Then use the &lt;code>ollama create&lt;/code> command to create the model.&lt;/p>
&lt;h3 id="33-multimodal-models">3.3 Multimodal Models&lt;/h3>
&lt;p>Ollama supports multimodal models (such as LLaVA) that can process both text and image inputs simultaneously.&lt;/p>
&lt;pre>&lt;code class="language-shell">ollama run llava &amp;quot;What's in this image? /path/to/image.png&amp;quot;
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="4-api-reference">4. API Reference&lt;/h2>
&lt;p>Ollama provides a set of REST APIs for programmatically interacting with models. The default service address is &lt;code>http://localhost:11434&lt;/code>.&lt;/p>
&lt;h3 id="41-apigenerate">4.1 &lt;code>/api/generate&lt;/code>&lt;/h3>
&lt;p>Generate text.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request (Streaming):&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/generate -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Why is the sky blue?&amp;quot;
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;li>&lt;strong>Request (Non-streaming):&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/generate -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;prompt&amp;quot;: &amp;quot;Why is the sky blue?&amp;quot;,
&amp;quot;stream&amp;quot;: false
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="42-apichat">4.2 &lt;code>/api/chat&lt;/code>&lt;/h3>
&lt;p>Conduct multi-turn conversations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/chat -d '{
&amp;quot;model&amp;quot;: &amp;quot;llama3&amp;quot;,
&amp;quot;messages&amp;quot;: [
{
&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,
&amp;quot;content&amp;quot;: &amp;quot;why is the sky blue?&amp;quot;
}
],
&amp;quot;stream&amp;quot;: false
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="43-apiembed">4.3 &lt;code>/api/embed&lt;/code>&lt;/h3>
&lt;p>Generate embedding vectors for text.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/embed -d '{
&amp;quot;model&amp;quot;: &amp;quot;all-minilm&amp;quot;,
&amp;quot;input&amp;quot;: [&amp;quot;Why is the sky blue?&amp;quot;, &amp;quot;Why is the grass green?&amp;quot;]
}'
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="44-apitags">4.4 &lt;code>/api/tags&lt;/code>&lt;/h3>
&lt;p>List all locally available models.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Request:&lt;/strong>
&lt;pre>&lt;code class="language-shell">curl http://localhost:11434/api/tags
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="5-command-line-tools-cli">5. Command Line Tools (CLI)&lt;/h2>
&lt;p>Ollama provides a rich set of command-line tools for managing models and interacting with the service.&lt;/p>
&lt;ul>
&lt;li>&lt;code>ollama run &amp;lt;model&amp;gt;&lt;/code>: Run a model.&lt;/li>
&lt;li>&lt;code>ollama create &amp;lt;model&amp;gt; -f &amp;lt;Modelfile&amp;gt;&lt;/code>: Create a model from a Modelfile.&lt;/li>
&lt;li>&lt;code>ollama pull &amp;lt;model&amp;gt;&lt;/code>: Pull a model from a remote repository.&lt;/li>
&lt;li>&lt;code>ollama push &amp;lt;model&amp;gt;&lt;/code>: Push a model to a remote repository.&lt;/li>
&lt;li>&lt;code>ollama list&lt;/code>: List local models.&lt;/li>
&lt;li>&lt;code>ollama cp &amp;lt;source_model&amp;gt; &amp;lt;dest_model&amp;gt;&lt;/code>: Copy a model.&lt;/li>
&lt;li>&lt;code>ollama rm &amp;lt;model&amp;gt;&lt;/code>: Delete a model.&lt;/li>
&lt;li>&lt;code>ollama ps&lt;/code>: View running models and their resource usage.&lt;/li>
&lt;li>&lt;code>ollama stop &amp;lt;model&amp;gt;&lt;/code>: Stop a running model and unload it from memory.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="6-advanced-features">6. Advanced Features&lt;/h2>
&lt;h3 id="61-openai-api-compatibility">6.1 OpenAI API Compatibility&lt;/h3>
&lt;p>Ollama provides an endpoint compatible with the OpenAI API, allowing you to seamlessly migrate existing OpenAI applications to Ollama. The default address is &lt;code>http://localhost:11434/v1&lt;/code>.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>List Models (Python):&lt;/strong>
&lt;pre>&lt;code class="language-python">from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.models.list()
print(response)
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="62-structured-output">6.2 Structured Output&lt;/h3>
&lt;p>By combining the OpenAI-compatible API with Pydantic, you can force the model to output JSON with a specific structure.&lt;/p>
&lt;pre>&lt;code class="language-python">from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url=&amp;quot;http://localhost:11434/v1&amp;quot;, api_key=&amp;quot;ollama&amp;quot;)
class UserInfo(BaseModel):
name: str
age: int
try:
completion = client.beta.chat.completions.parse(
model=&amp;quot;llama3.1:8b&amp;quot;,
messages=[{&amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;, &amp;quot;content&amp;quot;: &amp;quot;My name is John and I am 30 years old.&amp;quot;}],
response_format=UserInfo,
)
print(completion.choices[0].message.parsed)
except Exception as e:
print(f&amp;quot;Error: {e}&amp;quot;)
&lt;/code>&lt;/pre>
&lt;h3 id="63-performance-tuning">6.3 Performance Tuning&lt;/h3>
&lt;p>You can adjust Ollama's performance and resource management through environment variables:&lt;/p>
&lt;ul>
&lt;li>&lt;code>OLLAMA_KEEP_ALIVE&lt;/code>: Set how long models remain active in memory. For example, &lt;code>10m&lt;/code>, &lt;code>24h&lt;/code>, or &lt;code>-1&lt;/code> (permanent).&lt;/li>
&lt;li>&lt;code>OLLAMA_MAX_LOADED_MODELS&lt;/code>: Maximum number of models loaded into memory simultaneously.&lt;/li>
&lt;li>&lt;code>OLLAMA_NUM_PARALLEL&lt;/code>: Number of requests each model can process in parallel.&lt;/li>
&lt;/ul>
&lt;h3 id="64-lora-adapters">6.4 LoRA Adapters&lt;/h3>
&lt;p>Use the &lt;code>ADAPTER&lt;/code> directive in a &lt;code>Modelfile&lt;/code> to apply a LoRA (Low-Rank Adaptation) adapter, changing the model's behavior without modifying the base model weights.&lt;/p>
&lt;pre>&lt;code class="language-Modelfile">FROM llama3
ADAPTER /path/to/your-lora-adapter.safetensors
&lt;/code>&lt;/pre>
&lt;hr>
&lt;h2 id="7-appendix">7. Appendix&lt;/h2>
&lt;h3 id="71-troubleshooting">7.1 Troubleshooting&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Check CPU Features:&lt;/strong> On Linux, you can use the following command to check if your CPU supports instruction sets like AVX, which are crucial for the performance of certain models.
&lt;pre>&lt;code class="language-shell">cat /proc/cpuinfo | grep flags | head -1
&lt;/code>&lt;/pre>
&lt;/li>
&lt;/ul>
&lt;h3 id="72-contribution-guidelines">7.2 Contribution Guidelines&lt;/h3>
&lt;p>Ollama is an open-source project, and community contributions are welcome. When submitting code, please follow good commit message formats, for example:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Good:&lt;/strong> &lt;code>llm/backend/mlx: support the llama architecture&lt;/code>&lt;/li>
&lt;li>&lt;strong>Bad:&lt;/strong> &lt;code>feat: add more emoji&lt;/code>&lt;/li>
&lt;/ul>
&lt;h3 id="73-related-links">7.3 Related Links&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Official Website:&lt;/strong> &lt;a href="https://ollama.com/">&lt;a href="https://ollama.com/">https://ollama.com/&lt;/a>&lt;/a>&lt;/li>
&lt;li>&lt;strong>GitHub Repository:&lt;/strong> &lt;a href="https://github.com/ollama/ollama">&lt;a href="https://github.com/ollama/ollama">https://github.com/ollama/ollama&lt;/a>&lt;/a>&lt;/li>
&lt;li>&lt;strong>Model Library:&lt;/strong> &lt;a href="https://ollama.com/library">&lt;a href="https://ollama.com/library">https://ollama.com/library&lt;/a>&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Llama.cpp Technical Guide: Lightweight LLM Inference Engine</title><link>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</link><pubDate>Thu, 26 Jun 2025 01:06:00 +0000</pubDate><guid>https://ziyanglin.netlify.app/en/post/llama-cpp-documentation/</guid><description>&lt;h2 id="1-introduction">1. Introduction&lt;/h2>
&lt;p>Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.&lt;/p>
&lt;p>&lt;strong>Core Advantages:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High Performance:&lt;/strong> Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).&lt;/li>
&lt;li>&lt;strong>Lightweight:&lt;/strong> Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.&lt;/li>
&lt;li>&lt;strong>Cross-Platform:&lt;/strong> Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.&lt;/li>
&lt;li>&lt;strong>Open Ecosystem:&lt;/strong> Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.&lt;/li>
&lt;li>&lt;strong>Continuous Innovation:&lt;/strong> Quickly follows and implements the latest model architectures and inference optimization techniques.&lt;/li>
&lt;/ul>
&lt;h2 id="2-core-concepts">2. Core Concepts&lt;/h2>
&lt;h3 id="21-gguf-model-format">2.1. GGUF Model Format&lt;/h3>
&lt;p>GGUF (Georgi Gerganov Universal Format) is the core model file format used by &lt;code>llama.cpp&lt;/code>, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.&lt;/p>
&lt;p>&lt;strong>Key Features:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Unified File:&lt;/strong> Packages model metadata, vocabulary, and all tensors (weights) in a single file.&lt;/li>
&lt;li>&lt;strong>Extensibility:&lt;/strong> Allows adding new metadata without breaking compatibility.&lt;/li>
&lt;li>&lt;strong>Backward Compatibility:&lt;/strong> Guarantees compatibility with older versions of GGUF models.&lt;/li>
&lt;li>&lt;strong>Memory Efficiency:&lt;/strong> Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.&lt;/li>
&lt;/ul>
&lt;h3 id="22-quantization">2.2. Quantization&lt;/h3>
&lt;p>Quantization is one of the core advantages of &lt;code>llama.cpp&lt;/code>. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).&lt;/p>
&lt;p>&lt;strong>Main Benefits:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Reduced Model Size:&lt;/strong> Significantly reduces the size of model files, making them easier to distribute and store.&lt;/li>
&lt;li>&lt;strong>Lower Memory Usage:&lt;/strong> Reduces the RAM required to load the model into memory.&lt;/li>
&lt;li>&lt;strong>Faster Inference:&lt;/strong> Low-precision calculations are typically faster than high-precision ones, especially on CPUs.&lt;/li>
&lt;/ul>
&lt;p>&lt;code>llama.cpp&lt;/code> supports various quantization methods, particularly &lt;strong>k-quants&lt;/strong>, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.&lt;/p>
&lt;h3 id="23-multimodal-support">2.3. Multimodal Support&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Supported Models:&lt;/strong> Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.&lt;/li>
&lt;li>&lt;strong>Working Principle:&lt;/strong> Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.&lt;/li>
&lt;li>&lt;strong>Tools:&lt;/strong> &lt;code>llama-mtmd-cli&lt;/code> and &lt;code>llama-server&lt;/code> provide native support for multimodal models.&lt;/li>
&lt;/ul>
&lt;h2 id="3-usage-methods">3. Usage Methods&lt;/h2>
&lt;h3 id="31-compilation">3.1. Compilation&lt;/h3>
&lt;p>Compiling &lt;code>llama.cpp&lt;/code> from source is very simple.&lt;/p>
&lt;pre>&lt;code class="language-bash">git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make
&lt;/code>&lt;/pre>
&lt;p>For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:&lt;/p>
&lt;pre>&lt;code class="language-bash"># For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1
&lt;/code>&lt;/pre>
&lt;h3 id="32-basic-inference">3.2. Basic Inference&lt;/h3>
&lt;p>After compilation, you can use the &lt;code>llama-cli&lt;/code> tool for inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p &amp;quot;Building a website can be done in 10 simple steps:&amp;quot; -n 400
&lt;/code>&lt;/pre>
&lt;ul>
&lt;li>&lt;code>-m&lt;/code>: Specifies the path to the GGUF model file.&lt;/li>
&lt;li>&lt;code>-p&lt;/code>: Specifies the prompt.&lt;/li>
&lt;li>&lt;code>-n&lt;/code>: Specifies the maximum number of tokens to generate.&lt;/li>
&lt;/ul>
&lt;h3 id="33-openai-compatible-server">3.3. OpenAI Compatible Server&lt;/h3>
&lt;p>&lt;code>llama.cpp&lt;/code> provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.&lt;/p>
&lt;p>Starting the server:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096
&lt;/code>&lt;/pre>
&lt;p>You can then send requests to &lt;code>http://localhost:8080/v1/chat/completions&lt;/code> just like you would with the OpenAI API.&lt;/p>
&lt;h2 id="4-advanced-features">4. Advanced Features&lt;/h2>
&lt;h3 id="41-speculative-decoding">4.1. Speculative Decoding&lt;/h3>
&lt;p>This is an advanced inference optimization technique that significantly accelerates generation speed by using a small &amp;ldquo;draft&amp;rdquo; model to predict the output of the main model.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Working Principle:&lt;/strong> The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.&lt;/li>
&lt;li>&lt;strong>Usage:&lt;/strong> Use the &lt;code>--draft-model&lt;/code> parameter in &lt;code>llama-cli&lt;/code> or &lt;code>llama-server&lt;/code> to specify a small, fast draft model.&lt;/li>
&lt;/ul>
&lt;h3 id="42-lora-support">4.2. LoRA Support&lt;/h3>
&lt;p>LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. &lt;code>llama.cpp&lt;/code> supports loading one or more LoRA adapters during inference.&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base-model.gguf --lora lora-adapter.gguf
&lt;/code>&lt;/pre>
&lt;p>You can even set different weights for different LoRA adapters:&lt;/p>
&lt;pre>&lt;code class="language-bash">./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5
&lt;/code>&lt;/pre>
&lt;h3 id="43-grammars">4.3. Grammars&lt;/h3>
&lt;p>Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Format:&lt;/strong> Uses a format called GBNF (GGML BNF) to define grammar rules.&lt;/li>
&lt;li>&lt;strong>Application:&lt;/strong> By providing GBNF rules through the &lt;code>grammar&lt;/code> parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example:&lt;/strong> Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.&lt;/p>
&lt;pre>&lt;code class="language-python">import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))
&lt;/code>&lt;/pre>
&lt;h2 id="5-ecosystem">5. Ecosystem&lt;/h2>
&lt;p>The success of &lt;code>llama.cpp&lt;/code> has spawned a vibrant ecosystem:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://github.com/abetlen/llama-cpp-python">llama-cpp-python&lt;/a>:&lt;/strong> The most popular Python binding, providing interfaces to almost all features of &lt;code>llama.cpp&lt;/code> and deeply integrated with frameworks like LangChain and LlamaIndex.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://ollama.com/">Ollama&lt;/a>:&lt;/strong> A tool for packaging, distributing, and running models, using &lt;code>llama.cpp&lt;/code> under the hood, greatly simplifying the process of running LLMs locally.&lt;/li>
&lt;li>&lt;strong>Numerous UI Tools:&lt;/strong> The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.&lt;/li>
&lt;/ul>
&lt;h2 id="6-conclusion">6. Conclusion&lt;/h2>
&lt;p>&lt;code>llama.cpp&lt;/code> is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), &lt;code>llama.cpp&lt;/code> provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.&lt;/p></description></item></channel></rss>