High-level FastAPI service to help with RAG (Retrieval Augmented Generation) pipelines. It manages approved embedding models, provides an API to preload/unload models, compute embeddings with optional Redis caching, and exposes basic monitoring hooks. The project is designed to run locally with Python/uv or via Docker Compose and supports CPU, CUDA, and ROCm PyTorch builds via optional dependency extras.
- Overview
- Features
- Quickstart
- Prerequisites
- Installation (uv)
- Configuration (.env and models list)
- Run locally
- Run with Docker
- API Reference (summary)
- Settings (environment variables)
- Development
The service wraps sentence-transformers/Hugging Face embedding models behind a simple API. It can: list approved models, preload them to a local cache, compute embeddings (optionally cached in Redis), and manage model lifecycle in memory.
- FastAPI-based API with JSON or ORJSON responses
- Embedding endpoint with batch support and Redis result caching
- Approved model allowlist via YAML file
- Model lifecycle management: load, unload, list loaded, list available, get properties
- Works with CPU/CUDA/ROCm PyTorch wheels (choose via extras)
- Docker Compose setup with Redis
The ModelManager is the core component responsible for managing embedding and reranking models throughout their lifecycle. It provides:
- Model Loading & Unloading: Dynamically loads models on-demand and manages memory by unloading inactive models
- Automatic Cleanup: Background task that unloads models after a configurable timeout period of inactivity
- GPU Monitoring: Tracks GPU memory usage for NVIDIA (via nvidia-smi) and AMD ROCm devices
- Prometheus Metrics: Exposes model usage metrics including loaded model count, GPU memory per model, and inference times
- Thread-Safe Operations: Uses async locks to ensure safe concurrent access to models
- Device Detection: Automatically detects and uses available hardware (CPU, CUDA, or ROCm)
The ModelManager instantiates models using either SentenceTransformer (for embedding models) or CrossEncoder (for reranking models) from the sentence-transformers library. Each model is wrapped in a ModelInstance that tracks usage statistics and handles inference requests. Models are loaded lazily when first requested and can be preloaded to disk cache via the /models/preload endpoint.
ModelManager behavior is controlled by environment variables:
MODEL_MANAGER_TIMEOUT: Seconds of inactivity before auto-unloading (default: 600)GPU_MONITOR_LOOP_DELAY: GPU monitoring interval in seconds (default: 5)PROMETHEUS_LOOP_DELAY: Prometheus metrics update interval (default: 15)PRE_IMPORT_ON_BOOT: Whether to import model libraries at startup (default: false)
The manager is injected into route handlers via FastAPI's dependency injection system using get_model_manager().
- Python 3.12+ (tested with 3.14 in Docker args)
- uv (Python package/dependency manager): https://docs.astral.sh/uv/
- Redis (local or via Docker; docker-compose.yaml provides one)
- Clone the repo and change directory into it.
- Choose one extra for PyTorch (CPU is default):
- cpu (default)
- cu128 (CUDA 12.8)
- rocm (ROCm 6.4; not supported on Windows)
- docker (skip installing torch; useful when you rely on container-used with 'rocm-pytorch' image)
By default, the project’s uv configuration installs the default_extras = ["ai-rag-helper[cpu]"] group.
Sync dependencies:
uv sync
For dev dependencies:
uv sync --dev
To switch extras explicitly:
# CPU
uv sync --extra cpu
# CUDA 12.8
uv sync --extra cu128
# ROCm (non-Windows)
uv sync --extra rocm
# No torch (use Docker instead with 'rocm-pytorch' image)
uv sync --extra docker
# Install monitoring packages like Prometheus
uv sync --extra monitoring
-
Environment variables
- Copy
dot_env.exampleto.envand adjust values. - Important keys:
APP_PORT,API_ACCESS_KEY,REDIS_URL,HF_TOKEN, etc. See Settings section below.
- Copy
-
Approved models list
- The service reads model allowlist and properties from
src/config/.models.yaml. - An example file is provided:
src/config/dot.models_example.yaml. - Create your config by copying and editing:
- The service reads model allowlist and properties from
cp src/config/dot.models_example.yaml src/config/.models.yaml
Start the API with uvicorn:
uv run uvicorn src.main:app --host 0.0.0.0 --port 8000
Then open the docs:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
The repository includes a Dockerfile and docker-compose.yaml.
docker compose up -d --build && docker compose logs -f api
docker compose build --build-arg EXTRA=cu128 api && docker compose up -d && docker compose logs -f api
-
For CUDA 12.8:
EXTRA=cu128
-
For
rocm-pytorchdocker image:EXTRA=docker REPO_BUILDER=rocm/pytorch:latest
Can use
docker-compose.rocm.yamlwith predefinedROCM_VERSIONasdocker compose -f docker-compose.rocm.yaml ...command. -
For different Python version:
PYTHON_VERSION=3.12
Then run the general run command:
docker compose up -d --build && docker compose logs -f api- The compose file starts a Redis service and the API. The API service mounts
./srcand./data/modelsinto the container for live development and persisted model cache. - Torch extra can be controlled via build-arg
EXTRA(defaults tocpu). .envis passed into the container; setAPP_PORTthere (default 8000). The API will be available on http://localhost:8000- For ROCm, see
docker-compose.rocm.yamland related comments underdockers/if present on your system/hardware.
Base prefix: /api/v1
GET /models/available→ list[str] of approved model namesGET /models/loaded→ list of loaded modelsGET /models/load?model_name=...→ load a modelGET /models/unload?model_name=...→ unload a modelGET /models/properties?model_name=...→ return model properties (dimensions, max_tokens, batch_size, …)GET /models/preload→ preload all available models to disk cache (requires API key; see auth dependency)
POST /embed/→ compute embeddings- Request body (
schemas/embedding.py):{ "texts": ["hello", "world"], "model": "sentence-transformers/all-MiniLM-L6-v2", "batch_size": 32 } - Response (
EmbeddingResponse): vectors and metadata
- Request body (
POST /rerank/→ rerank documents based on query relevance- Request body (
schemas/rerank.py):{ "query": "What is machine learning?", "candidates": ["ML is a subset of AI", "The sky is blue", "Neural networks learn patterns"], "model": "sentence-transformers/all-MiniLM-L6-v2" } - Response (
RerankResponse): ranked candidates with relevance scores - Uses cross-encoder models to score query-document pairs for better relevance ranking in RAG pipelines
- Request body (
POST /cache/setwith body{ "key": "k", "value": "v" }GET /cache/get?key=...
Defined in src/config/settings.py (Pydantic BaseSettings). Key values include:
REDIS_URL(env:redis_url) default:redis://redis:6379/0LOG_LEVEL(env:log_level) one of: DEBUG, INFO, WARNING, ERROR, CRITICALAPP_NAME(env:app_name) default:AI RAG HelperAPP_VERSION(env:app_version) default:0.0.1DEBUG(env:debug) default:falseAPI_PREFIX(env:api_prefix) default:/api/v1CORS_ORIGINS(env:cors_origins) default:[*]API_ACCESS_KEY(env:api_access_key) optional; when set, certain endpoints require itMODEL_MANAGER_TIMEOUT(env:model_manager_timeout) default:600secondsGPU_MONITOR_LOOP_DELAY(env:gpu_monitor_loop_delay) default:5secondsPROMETHEUS_LOOP_DELAY(env:prometheus_loop_delay) default:15secondsPRE_IMPORT_ON_BOOT(env:pre_import_on_boot) default:falseAPPROVED_MODELS_CONFIG_PATH(env:approved_models_config_path) default:config/.models.yaml(resolved undersrc/)MODEL_CACHE_FOLDER(env:model_cache_folder) default:models(resolved todata/modelsunder repo root)MODEL_CACHE_FOLDER_ONLY_LOCAL(env:model_cache_folder_only_local) default:falseEMBEDDING_CACHE_RESULTS(env:embedding_cache_results) default:trueEMBEDDING_CACHE_TTL(env:embedding_cache_ttl) default: one weekHF_TOKEN(env:hf_token) optional; set for private models or rate limitsDEFAULT_MODEL_NAMES(env:default_model_names) JSON mapping, e.g.{ "embed": "…", "rerank": "…" }
- Code style: black, isort, and flake8 settings in
pyproject.toml(line length 120) - Linting/formatting: install dev tools with
uv sync --group dev - Pre-commit:
pre-commit install && pre-commit run -a
- There is an example HTTP test file:
tests/test_main.httpyou can run with IDE HTTP client or REST tools.
ai-rag-helper/
├─ src/
│ ├─ main.py # FastAPI app, CORS, exception handlers, routers
│ ├─ lifespan.py # app lifespan hooks
│ ├─ logger_config.py # logging setup helpers
│ ├─ routers/
│ │ ├─ embedding.py # /api/v1/embed endpoint
│ │ ├─ models.py # /api/v1/models endpoints
│ │ └─ cache.py # /api/v1/cache endpoints
│ ├─ config/
│ │ ├─ settings.py # Pydantic BaseSettings and paths
│ │ ├─ models_list_config.py # YAML loader/validator for approved models
│ │ ├─ dot.models_example.yaml # example for .models.yaml
│ │ └─ .models.yaml # your models config (ignored if not committed)
│ ├─ model_manager.py # model lifecycle and cache mgmt
│ ├─ schemas/ # request/response models
│ └─ handlers/ # business logic for routers
├─ data/models/ # on-disk model cache (created on first run)
├─ docker-compose.yaml # API + Redis
├─ docker-compose.rocm.yaml # ROCm-oriented compose (if using AMD)
├─ Dockerfile
├─ dot_env.example # copy to .env
├─ pyproject.toml # dependencies & tooling; uv configuration
└─ uv.lock
This repository’s license is MIT.