Local Deployment | Ziyang Lin

Ollama Practical Guide: Local Deployment and Management of Large Language Models

Fri, 27 Jun 2025 02:00:00 +0000

1. Introduction

Ollama is a powerful open-source tool designed to allow users to easily download, run, and manage large language models (LLMs) in local environments. Its core advantage lies in simplifying the deployment and use of complex models, enabling developers, researchers, and enthusiasts to experience and utilize state-of-the-art artificial intelligence technology on personal computers without specialized hardware or complex configurations.

Key Advantages:

Ease of Use: Complete model download, running, and interaction through simple command-line instructions.
Cross-Platform Support: Supports macOS, Windows, and Linux.
Rich Model Library: Supports numerous popular open-source models such as Llama 3, Mistral, Gemma, Phi-3, and more.
Highly Customizable: Through Modelfile, users can easily customize model behavior, system prompts, and parameters.
API-Driven: Provides a REST API for easy integration with other applications and services.
Open Source Community: Has an active community continuously contributing new models and features.

This document will provide a comprehensive introduction to Ollama's various features, from basic fundamentals to advanced applications, helping you fully master this powerful tool.

2. Quick Start

This section will guide you through installing and basic usage of Ollama.

2.1 Installation

Visit the Ollama official website to download and install the package suitable for your operating system.

2.2 Running Your First Model

After installation, open a terminal (or command prompt) and use the ollama run command to download and run a model. For example, to run the Llama 3 model:

ollama run llama3

On first run, Ollama will automatically download the required model files from the model library. Once the download is complete, you can directly converse with the model in the terminal.

2.3 Managing Local Models

You can use the following commands to manage locally downloaded models:

List Local Models:
```
ollama list
```
This command displays the name, ID, size, and modification time of all downloaded models.
Remove Local Models:
```
ollama rm <model_name>
```

3. Core Concepts

3.1 Modelfile

Modelfile is one of Ollama's core features. It's a configuration file similar to Dockerfile that allows you to define and create custom models. Through Modelfile, you can:

Specify a base model.
Set model parameters (such as temperature, top_p, etc.).
Define the model's system prompt.
Customize the model's interaction template.
Apply LoRA adapters.

A simple Modelfile example:

# Specify base model
FROM llama3
# Set model temperature
PARAMETER temperature 0.8
# Set system prompt
SYSTEM """
You are a helpful AI assistant. Your name is Roo.
"""

Use the ollama create command to create a new model based on a Modelfile:

ollama create my-custom-model -f ./Modelfile

3.2 Model Import

Ollama supports importing models from external file systems, particularly from Safetensors format weight files.

In a Modelfile, use the FROM directive and provide the directory path containing safetensors files:

FROM /path/to/safetensors/directory

Then use the ollama create command to create the model.

3.3 Multimodal Models

Ollama supports multimodal models (such as LLaVA) that can process both text and image inputs simultaneously.

ollama run llava "What's in this image? /path/to/image.png"

4. API Reference

Ollama provides a set of REST APIs for programmatically interacting with models. The default service address is http://localhost:11434.

4.1 `/api/generate`

Generate text.

Request (Streaming):

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'

Request (Non-streaming):

curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'

4.2 `/api/chat`

Conduct multi-turn conversations.

Request:

curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'

4.3 `/api/embed`

Generate embedding vectors for text.

Request:

curl http://localhost:11434/api/embed -d '{
"model": "all-minilm",
"input": ["Why is the sky blue?", "Why is the grass green?"]
}'

4.4 `/api/tags`

List all locally available models.

Request:
```
curl http://localhost:11434/api/tags
```

5. Command Line Tools (CLI)

Ollama provides a rich set of command-line tools for managing models and interacting with the service.

ollama run <model>: Run a model.
ollama create <model> -f <Modelfile>: Create a model from a Modelfile.
ollama pull <model>: Pull a model from a remote repository.
ollama push <model>: Push a model to a remote repository.
ollama list: List local models.
ollama cp <source_model> <dest_model>: Copy a model.
ollama rm <model>: Delete a model.
ollama ps: View running models and their resource usage.
ollama stop <model>: Stop a running model and unload it from memory.

6. Advanced Features

6.1 OpenAI API Compatibility

Ollama provides an endpoint compatible with the OpenAI API, allowing you to seamlessly migrate existing OpenAI applications to Ollama. The default address is http://localhost:11434/v1.

List Models (Python):

from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required, but unused
)
response = client.models.list()
print(response)

6.2 Structured Output

By combining the OpenAI-compatible API with Pydantic, you can force the model to output JSON with a specific structure.

from pydantic import BaseModel
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
class UserInfo(BaseModel):
name: str
age: int
try:
completion = client.beta.chat.completions.parse(
model="llama3.1:8b",
messages=[{"role": "user", "content": "My name is John and I am 30 years old."}],
response_format=UserInfo,
)
print(completion.choices[0].message.parsed)
except Exception as e:
print(f"Error: {e}")

6.3 Performance Tuning

You can adjust Ollama's performance and resource management through environment variables:

OLLAMA_KEEP_ALIVE: Set how long models remain active in memory. For example, 10m, 24h, or -1 (permanent).
OLLAMA_MAX_LOADED_MODELS: Maximum number of models loaded into memory simultaneously.
OLLAMA_NUM_PARALLEL: Number of requests each model can process in parallel.

6.4 LoRA Adapters

Use the ADAPTER directive in a Modelfile to apply a LoRA (Low-Rank Adaptation) adapter, changing the model's behavior without modifying the base model weights.

FROM llama3
ADAPTER /path/to/your-lora-adapter.safetensors

7. Appendix

7.1 Troubleshooting

Check CPU Features: On Linux, you can use the following command to check if your CPU supports instruction sets like AVX, which are crucial for the performance of certain models.
```
cat /proc/cpuinfo | grep flags | head -1
```

7.2 Contribution Guidelines

Ollama is an open-source project, and community contributions are welcome. When submitting code, please follow good commit message formats, for example:

Good: llm/backend/mlx: support the llama architecture
Bad: feat: add more emoji

Official Website: https://ollama.com/
GitHub Repository: https://github.com/ollama/ollama
Model Library: https://ollama.com/library

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

Thu, 26 Jun 2025 01:06:00 +0000

1. Introduction

Llama.cpp is a high-performance, lightweight inference framework for large language models (LLMs) written in C/C++. It focuses on efficiently running LLMs on consumer-grade hardware, making local inference possible on ordinary laptops and even smartphones.

Core Advantages:

High Performance: Achieves extremely fast inference speeds through optimized C/C++ code, quantization techniques, and hardware acceleration support (such as Apple Metal, CUDA, OpenCL, SYCL).
Lightweight: Extremely low memory and computational resource consumption, eliminating the need for expensive GPUs.
Cross-Platform: Supports multiple platforms including macOS, Linux, Windows, Docker, Android, and iOS.
Open Ecosystem: Features an active community and rich ecosystem, including Python bindings, UI tools, and OpenAI-compatible servers.
Continuous Innovation: Quickly follows and implements the latest model architectures and inference optimization techniques.

2. Core Concepts

2.1. GGUF Model Format

GGUF (Georgi Gerganov Universal Format) is the core model file format used by llama.cpp, an evolution of its predecessor GGML. GGUF is a binary format designed for fast loading and memory mapping.

Key Features:

Unified File: Packages model metadata, vocabulary, and all tensors (weights) in a single file.
Extensibility: Allows adding new metadata without breaking compatibility.
Backward Compatibility: Guarantees compatibility with older versions of GGUF models.
Memory Efficiency: Supports memory mapping (mmap), allowing multiple processes to share the same model weights, thereby saving memory.

2.2. Quantization

Quantization is one of the core advantages of llama.cpp. It is a technique that converts model weights from high-precision floating-point numbers (such as 32-bit or 16-bit) to low-precision integers (such as 4-bit, 5-bit, or 8-bit).

Main Benefits:

Reduced Model Size: Significantly reduces the size of model files, making them easier to distribute and store.
Lower Memory Usage: Reduces the RAM required to load the model into memory.
Faster Inference: Low-precision calculations are typically faster than high-precision ones, especially on CPUs.

llama.cpp supports various quantization methods, particularly k-quants, an advanced quantization technique that achieves extremely high compression rates while maintaining high model performance.

2.3. Multimodal Support

llama.cpp is not limited to text models; it has evolved into a powerful multimodal inference engine that supports processing text, images, and even audio simultaneously.

Supported Models: Supports various mainstream multimodal models such as LLaVA, MobileVLM, Granite, Qwen2.5 Omni, InternVL, SmolVLM, etc.
Working Principle: Typically converts images into embedding vectors through a vision encoder (such as CLIP), and then inputs these vectors along with text embedding vectors into the LLM.
Tools: llama-mtmd-cli and llama-server provide native support for multimodal models.

3. Usage Methods

3.1. Compilation

Compiling llama.cpp from source is very simple.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
make

For specific hardware acceleration (such as CUDA or Metal), use the corresponding compilation options:

# For CUDA
make LLAMA_CUDA=1
# For Metal (on macOS)
make LLAMA_METAL=1

3.2. Basic Inference

After compilation, you can use the llama-cli tool for inference.

./llama-cli -m ./models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400

-m: Specifies the path to the GGUF model file.
-p: Specifies the prompt.
-n: Specifies the maximum number of tokens to generate.

3.3. OpenAI Compatible Server

llama.cpp provides a built-in HTTP server with an API compatible with OpenAI's API. This makes it easy to integrate with existing tools like LangChain and LlamaIndex.

Starting the server:

./llama-server -m models/7B/ggml-model-q4_0.gguf -c 4096

You can then send requests to http://localhost:8080/v1/chat/completions just like you would with the OpenAI API.

4. Advanced Features

4.1. Speculative Decoding

This is an advanced inference optimization technique that significantly accelerates generation speed by using a small “draft” model to predict the output of the main model.

Working Principle: The draft model quickly generates a draft token sequence, which is then validated all at once by the main model. If validated, it saves the time of generating tokens one by one.
Usage: Use the --draft-model parameter in llama-cli or llama-server to specify a small, fast draft model.

4.2. LoRA Support

LoRA (Low-Rank Adaptation) allows fine-tuning a model's behavior by training a small adapter without modifying the original model weights. llama.cpp supports loading one or more LoRA adapters during inference.

./llama-cli -m base-model.gguf --lora lora-adapter.gguf

You can even set different weights for different LoRA adapters:

./llama-cli -m base.gguf --lora-scaled lora_A.gguf 0.5 --lora-scaled lora_B.gguf 0.5

4.3. Grammars

Grammars are a very powerful feature that allows you to force the model's output to follow a specific format, such as a strict JSON schema.

Format: Uses a format called GBNF (GGML BNF) to define grammar rules.
Application: By providing GBNF rules through the grammar parameter in API requests, you can ensure that the model returns correctly formatted, directly parsable JSON data, avoiding output format errors and tedious post-processing.

Example: Using a Pydantic model to generate a JSON Schema, then converting it to GBNF to ensure the model output conforms to the expected Python object structure.

import json
from typing import List
from pydantic import BaseModel
class QAPair(BaseModel):
question: str
answer: str
class Summary(BaseModel):
key_facts: List[str]
qa_pairs: List[QAPair]
# Generate JSON Schema and print
schema = Summary.model_json_schema()
print(json.dumps(schema, indent=2))

5. Ecosystem

The success of llama.cpp has spawned a vibrant ecosystem:

llama-cpp-python: The most popular Python binding, providing interfaces to almost all features of llama.cpp and deeply integrated with frameworks like LangChain and LlamaIndex.
Ollama: A tool for packaging, distributing, and running models, using llama.cpp under the hood, greatly simplifying the process of running LLMs locally.
Numerous UI Tools: The community has developed a large number of graphical interface tools, allowing non-technical users to easily interact with local models.

6. Conclusion

llama.cpp is not just an inference engine; it has become a key force in driving the localization and popularization of LLMs. Through its excellent performance, highly optimized resource usage, and continuously expanding feature set (such as multimodality and grammar constraints), llama.cpp provides developers and researchers with a powerful and flexible platform, enabling them to explore and deploy AI applications on various devices, ushering in a new era of low-cost, privacy-protecting local AI.

Local Deployment | Ziyang Lin

Ollama Practical Guide: Local Deployment and Management of Large Language Models

1. Introduction

2. Quick Start

2.1 Installation

2.2 Running Your First Model

2.3 Managing Local Models

3. Core Concepts

3.1 Modelfile

3.2 Model Import

3.3 Multimodal Models

4. API Reference

4.1 /api/generate

4.2 /api/chat

4.3 /api/embed

4.4 /api/tags

5. Command Line Tools (CLI)

6. Advanced Features

6.1 OpenAI API Compatibility

6.2 Structured Output

6.3 Performance Tuning

6.4 LoRA Adapters

7. Appendix

7.1 Troubleshooting

7.2 Contribution Guidelines

7.3 Related Links

Llama.cpp Technical Guide: Lightweight LLM Inference Engine

1. Introduction

2. Core Concepts

2.1. GGUF Model Format

2.2. Quantization

2.3. Multimodal Support

3. Usage Methods

3.1. Compilation

3.2. Basic Inference

3.3. OpenAI Compatible Server

4. Advanced Features

4.1. Speculative Decoding

4.2. LoRA Support

4.3. Grammars

5. Ecosystem

6. Conclusion

4.1 `/api/generate`

4.2 `/api/chat`

4.3 `/api/embed`

4.4 `/api/tags`