| title | 🦙 Ollama — Setup Guide |
|---|---|
| nav_order | 4 |
Ollama is an open-source runtime for running large language models (LLMs) locally, without any cloud dependency. It runs as a background service and exposes a REST API on port 11434.
Minimum requirements: 8 GB RAM (16 GB+ recommended) · Internet connection for initial model download · Compatible GPU for accelerated inference (optional but recommended)
- Download the installer from https://ollama.com/download
- Run the
.exefile and follow the on-screen instructions - The installer automatically adds
ollamato the systemPATH - After installation, Ollama appears in the system tray and starts as a background service
Verify installation:
ollama --versionRequires macOS 14 Sonoma or later.
- Download the
.zipfrom https://ollama.com/download - Unzip the archive and drag
Ollama.appinto/Applications - Launch the application — an icon will appear in the menu bar
- On first launch, install the CLI tools when prompted
Alternatively, via Homebrew:
brew install ollamaVerify installation:
ollama --versionOne-line installation (recommended):
curl -fsSL https://ollama.com/install.sh | shThe script installs the Ollama binary and automatically registers a systemd service that starts on system boot.
Check service status:
sudo systemctl status ollamaStart / stop / restart manually:
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollamaVerify installation:
ollama --version# Download a model without running it
ollama pull qwen2.5:14b
# Download a specific version / quantization
ollama pull llama3.2:3b-instruct-q4_K_MModels are stored locally and available offline after the initial download.
ollama run qwen2.5:14bIf the model has not been pulled yet,
runwill automatically pull it first.
ollama list # List all downloaded models
ollama show qwen2.5:14b # Show model details (architecture, parameters)
ollama ps # Show models currently loaded in memory
ollama stop qwen2.5:14b # Unload a model from memory
ollama rm qwen2.5:14b # Delete a model from diskOllama automatically detects an available GPU and uses it when conditions are met. GPU acceleration significantly increases inference speed.
Requirements: NVIDIA drivers + CUDA Toolkit
# Verify GPU availability
nvidia-smi
# Check whether Ollama is using the GPU
ollama ps
# The PROCESSOR column should show "GPU" or "GPU+CPU"Recommended: Install the CUDA Toolkit for your driver version before installing Ollama.
For multi-GPU setups, Ollama automatically distributes model layers across GPUs when a model doesn't fit on a single card. To control which GPUs are used:
export CUDA_VISIBLE_DEVICES=0 # Use only the first GPU
export CUDA_VISIBLE_DEVICES=0,1 # Use both GPUsOn Macs with M-series chips (M1/M2/M3/M4), Ollama automatically uses Metal GPU acceleration with no additional configuration required. The unified memory architecture makes Metal particularly efficient — there is no VRAM/RAM split.
The Ollama server listens on http://localhost:11434 by default. All model interactions can be performed via HTTP requests.
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:14b",
"prompt": "Extract symptoms from: Patient reports fever and headache.",
"stream": false
}'curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5:14b",
"messages": [
{ "role": "system", "content": "You are a medical NLP assistant." },
{ "role": "user", "content": "Extract symptoms from the following text..." }
],
"stream": false
}'import requests
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "qwen2.5:14b",
"messages": [
{"role": "system", "content": "You are a medical NLP assistant."},
{"role": "user", "content": "Extract symptoms from: ..."}
],
"stream": False
}
)
data = response.json()
print(data["message"]["content"])Or via the OpenAI-compatible SDK (Ollama exposes an OpenAI-compatible endpoint):
pip install openaifrom openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required by the SDK, but ignored by Ollama
)
response = client.chat.completions.create(
model="qwen2.5:14b",
messages=[
{"role": "system", "content": "You are a medical NLP assistant."},
{"role": "user", "content": "Extract symptoms from: ..."}
]
)
print(response.choices[0].message.content)When using Ollama as an OpenAI-compatible server, only a subset of environment variables are relevant. The variables below affect the server itself — network exposure, model storage, memory behavior, and GPU acceleration — all of which directly impact how your OpenAI SDK client connects and performs.
Variables related to internal concurrency (
OLLAMA_NUM_PARALLEL,OLLAMA_MAX_QUEUE,OLLAMA_MAX_LOADED_MODELS) and low-level cache tuning (OLLAMA_KV_CACHE_TYPE,OLLAMA_ORIGINS) are server-side concerns and are not needed in this context.
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST |
127.0.0.1:11434 |
IP address and port the server listens on — set to 0.0.0.0:11434 to expose to other devices |
OLLAMA_MODELS |
~/.ollama/models |
Path to the directory where models are stored |
OLLAMA_KEEP_ALIVE |
5m |
How long a model stays loaded in memory after the last request — increase to reduce cold-start latency |
OLLAMA_FLASH_ATTENTION |
0 |
Enable Flash Attention to reduce VRAM usage (set to 1) |
CUDA_VISIBLE_DEVICES |
all | Which NVIDIA GPUs are visible to Ollama |
OLLAMA_DEBUG |
0 |
Enable verbose server logging — useful when diagnosing connection issues from the client (set to 1) |
sudo systemctl edit ollamaIn the editor that opens, add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_FLASH_ATTENTION=1"Apply the changes:
sudo systemctl daemon-reload
sudo systemctl restart ollamalaunchctl setenv OLLAMA_HOST "0.0.0.0"
launchctl setenv OLLAMA_KEEP_ALIVE "10m"Then restart the Ollama application from the Applications folder.
- Quit Ollama from the system tray (right-click → Quit)
- Open Settings → search for "environment variables"
- Click Edit environment variables for your account
- Add or edit the desired variables (e.g.
OLLAMA_HOST=0.0.0.0:11434) - Click OK and relaunch Ollama from the Start menu
⚠️ Security warning: Ollama has no built-in authentication. Exposing it on0.0.0.0makes the API accessible to anyone on the same network — or the internet if your firewall is open. Do not expose Ollama publicly without a reverse proxy (e.g. Nginx) that enforces authentication and TLS. For local development, prefer keepingOLLAMA_HOST=127.0.0.1:11434and using SSH tunneling to access it remotely.
For Ollama to be reachable outside the local machine, the server must listen on all interfaces:
# Linux (systemd)
Environment="OLLAMA_HOST=0.0.0.0:11434"
# macOS
launchctl setenv OLLAMA_HOST "0.0.0.0"
# Windows — set OLLAMA_HOST=0.0.0.0:11434 as an environment variableThe API will then be accessible at http://<SERVER_IP>:11434.
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen2.5:14b
OLLAMA_TIMEOUT=120import os
from dotenv import load_dotenv
import requests
load_dotenv()
BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
MODEL = os.getenv("OLLAMA_MODEL", "qwen2.5:14b")
def extract_symptoms(text: str) -> dict:
response = requests.post(
f"{BASE_URL}/api/chat",
json={
"model": MODEL,
"messages": [
{"role": "system", "content": "You are a medical NLP assistant."},
{"role": "user", "content": text}
],
"stream": False
},
timeout=int(os.getenv("OLLAMA_TIMEOUT", 120))
)
response.raise_for_status()
return response.json()Before sending requests, it is recommended to verify that the service is running:
import requests
def is_ollama_running(base_url: str = "http://localhost:11434") -> bool:
try:
r = requests.get(base_url, timeout=3)
return r.status_code == 200
except requests.ConnectionError:
return False| Problem | Cause | Solution |
|---|---|---|
command not found: ollama |
Not in PATH | Restart the terminal or add Ollama to PATH manually |
connection refused on port 11434 |
Service is not running | Run ollama serve or systemctl start ollama |
| Model generates very slowly | Running on CPU instead of GPU | Check ollama ps — the PROCESSOR column should show GPU |
503 server overloaded |
Too many parallel requests | Increase OLLAMA_MAX_QUEUE |
| Model unloads immediately | OLLAMA_KEEP_ALIVE too low |
Set to 30m or -1 to keep it loaded indefinitely |
| Not enough VRAM | Model too large for the GPU | Use a quantized version (e.g. q4_K_M) |