A lightweight wrapper around the Moondream2 transformers implementation, providing clean REST API endpoints for vision-language model capabilities.
This replaces the complex Moondream Station binary with a simple, maintainable Python service that exposes the core Moondream capabilities through a REST API.
moondream-server/
├── app/ # FastAPI application
│ ├── __init__.py # Package initialization
│ ├── app.py # Main FastAPI application
│ ├── requirements.txt # Python dependencies
│ └── test_api.py # API test suite
├── charts/ # Helm chart for Kubernetes deployment
│ ├── templates/ # Kubernetes manifests
│ ├── Chart.yaml # Helm chart metadata
│ └── values.yaml # Default configuration values
├── Dockerfile # Container image definition
├── Makefile # Build and deployment automation
└── README.md # This file
docker build -t moondream-api .
docker run -p 8080:8080 -v moondream-api-models:/root/.cache/huggingface moondream-apipip install -r app/requirements.txt
python -m app.appThe service will be available at http://localhost:8080
When running in Docker on macOS, the service will automatically use CPU mode since Docker doesn't support MPS (Metal Performance Shaders) passthrough yet. This is normal and expected behavior.
- Local development: Uses MPS for GPU acceleration on Apple Silicon
- Docker: Falls back to CPU mode automatically
- Linux with NVIDIA GPU: Uses CUDA when available
To eliminate lengthy startup times when pods are rescheduled, the Helm chart includes a model pre-loading feature:
- Kubernetes Job downloads the model to a shared PersistentVolumeClaim during installation
- Application pods mount the same PVC, accessing pre-downloaded models instantly
- No startup delay when pods move between nodes or restart
The model pre-loading is configured under the unified persistence section in values.yaml:
persistence:
enabled: true
storageClass: "" # Storage class for PVC
accessMode: ReadWriteMany # RWM for rolling deployments and model sharing
size: 20Gi
existingClaim: "" # Optional: use existing PVC instead of creating new one
modelCache:
enabled: true # Download models to the PVC
modelRevision: "2025-06-21" # Model version to download
waitForCache: true # Wait for model download job to complete before starting podsThe service provides the following endpoints:
GET /health- Health checkGET /v1- API infoPOST /v1/caption- Generate image captionsPOST /v1/query- Answer questions about imagesPOST /v1/detect- Detect objects in imagesPOST /v1/point- Locate objects in images
# Health check
curl http://localhost:8080/health
# Generate caption
curl -X POST "http://localhost:8080/v1/caption" \
-F "image=@your_image.jpg" \
-F "length=short"
# Ask a question
curl -X POST "http://localhost:8080/v1/query" \
-F "image=@your_image.jpg" \
-F "question=What do you see in this image?"
# Detect objects
curl -X POST "http://localhost:8080/v1/detect" \
-F "image=@your_image.jpg" \
-F "object_name=person"
# Point to objects
curl -X POST "http://localhost:8080/v1/point" \
-F "image=@your_image.jpg" \
-F "object_name=car"Run the comprehensive test suite:
# Test all endpoints with a real image
python app/test_api.pyThe test suite downloads a real image from Unsplash and tests all API endpoints to ensure proper functionality.
helm install moondream-server oci://ghcr.io/mad-deecent/charts/moondream-server \
--namespace moondream --create-namespaceOr install from source:
git clone https://github.com/Mad-Deecent/moondream-server.git
cd moondream-server
helm install moondream-server ./charts --namespace moondream --create-namespaceTo deploy on GPU nodes, configure node selection in your values:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-GeForce-RTX-3060"Default configuration requests 1 GPU and 2GB memory. Adjust based on your needs:
resources:
requests:
cpu: 1000m
memory: 2Gi
nvidia.com/gpu: 1
limits:
cpu: 4000m
memory: 8Gi
nvidia.com/gpu: 1helm install moondream-server ./charts \
--namespace <my-namespace> \
--create-namespaceFor issues related to the Helm chart or containerization, open an issue in this repository.
For Moondream Station itself, visit moondream.ai or their official documentation.