A comprehensive Edge AI platform with LLM (Ollama) and ML (ONNX Runtime) serving capabilities, monitoring, model conversion, and benchmarking tools.
This project includes a powerful command-line interface (CLI) for converting and validating machine learning models, with a focus on ONNX format.
# Install the package in development mode
pip install -e .
# Install with TensorFlow support (for Keras/SavedModel conversion)
pip install -e .[tensorflow]
# Install with PyTorch support
pip install -e .[torch]
# Install with all dependencies
pip install -e .[all]Benchmark ONNX models for performance metrics:
# Benchmark a single model
wronai_edge benchmark path/to/model.onnx --input-shape 1,3,224,224
# Compare multiple models
wronai_edge benchmark model1.onnx model2.onnx --compare --input-shape 1,3,224,224
# Customize benchmark parameters
wronai_edge benchmark model.onnx --warmup 20 --runs 200 --cpuOptions:
--input-shape,-i: Input shape (can be specified multiple times for multiple inputs)--warmup: Number of warmup runs (default: 10)--runs: Number of benchmark runs (default: 100)--cpu/--gpu: Force CPU or GPU usage (default: GPU if available)--compare: Compare multiple models side by side
Validate an ONNX model:
wronai_edge test-model path/to/model.onnxOptions:
--output-json: Save validation results to a JSON file--verbose,-v: Enable verbose output
Example:
wronai_edge test-model models/simple-model.onnx --output-json validation_results.json --verboseConvert models between different formats using the convert command group.
PyTorch to ONNX:
wronai_edge convert pytorch model.pt output.onnx --input-shape 1,3,224,224Keras to ONNX:
wronai_edge convert keras model.h5 output.onnx --input-shape 1,224,224,3TensorFlow SavedModel to ONNX:
wronai_edge convert saved-model saved_model_dir output.onnxCommon options for conversion:
--opset: ONNX opset version (default: 13)--verbose,-v: Enable verbose output
You can also use the conversion and validation tools programmatically:
from wronai_edge import validate_model, convert_to_onnx
# Validate a model
results = validate_model("model.onnx")
print(f"Model validation passed: {results['validation_summary']['passed']}")
# Convert a PyTorch model to ONNX
convert_to_onnx(
model_path="model.pt",
output_path="output.onnx",
input_shapes=[(1, 3, 224, 224)],
opset_version=13
)For more examples, see the examples directory.
For detailed documentation about the Edge AI platform, including LLM serving and monitoring, see the sections below.
- 📖 Overview
- 🚀 Quick Start
- 📊 Architecture
- 🔧 Services
- 📈 Monitoring
- 🔍 Examples
- 🧩 API Reference
- 🧪 Testing
- 🧹 Cleanup
- Docker and Docker Compose
- Python 3.8+ (for running tests and examples)
- At least 8GB RAM (16GB recommended for running LLMs)
curlandjq(for testing and examples)
-
Clone the repository:
git clone https://github.com/wronai/edge.git cd edge -
Start all services:
docker-compose up -d
-
Verify services are running:
docker-compose ps
All services should show as "healthy" or "running".
-
Run the test suite to verify everything is working:
./test_services.sh
- Ollama API: http://localhost:11435
- ONNX Runtime: http://localhost:8001
- Nginx Gateway: http://localhost:30080
- Grafana: http://localhost:3007 (admin/admin)
- Prometheus: http://localhost:9090
# Check ONNX Runtime status
make onnx-status
# List available ONNX models
make onnx-models
# Load a new model
make onnx-load MODEL=simple-model MODEL_SOURCE=./models/simple-model.onnx
# Test inference with a sample request
make onnx-testFor detailed ONNX Runtime documentation, see docs/onnx-runtime.md
Here's how to use the ONNX Runtime service for model inference:
-
Check service health:
curl http://localhost:8001/health # Expected response: {"status": "OK"} -
List available models:
curl http://localhost:8001/v1/models # Example response: {"models": ["model1.onnx", "model2.onnx"]} -
Run inference (using Python):
import requests import numpy as np # Sample input data (adjust based on your model's expected input) input_data = { "model_name": "wronai.onnx", "input": { "input_1": np.random.rand(1, 224, 224, 3).tolist() # Example for image input } } # Send inference request response = requests.post( "http://localhost:8001/v1/models/your_model:predict", json=input_data ) # Process the response if response.status_code == 200: predictions = response.json() print("Inference successful!") print(f"Predictions: {predictions}") else: print(f"Error: {response.status_code}") print(response.text)
-
Using cURL for simple inference:
curl -X POST http://localhost:8001/v1/models/your_model:predict \ -H "Content-Type: application/json" \ -d '{"input": [[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]]}'
For more advanced usage, refer to the API Reference.
To stop all services:
docker-compose downTo remove all data (including models and metrics):
docker-compose down -vgraph TD
A[Client] -->|HTTP/HTTPS| B[Nginx Gateway]
B -->|/api/ollama/*| C[Ollama Service]
B -->|/api/onnx/*| D[ONNX Runtime]
B -->|/grafana| E[Grafana]
B -->|/prometheus| F[Prometheus]
G[Prometheus] -->|Scrape Metrics| H[Services]
E -->|Query| G
C -->|Store Models| I[(Ollama Models)]
D -->|Load Models| J[(ONNX Models)]
┌─────────────────┬──────────┬──────────────────────────────────────────┐
│ Service │ Port │ Description │
├─────────────────┼──────────┼──────────────────────────────────────────┤
│ Nginx Gateway │ 30080 │ API Gateway and reverse proxy │
│ Ollama │ 11435 │ LLM serving (compatible with OpenAI API) │
│ ONNX Runtime │ 8001 │ ML model inference │
│ Prometheus │ 9090 │ Metrics collection and alerting │
│ Grafana │ 3007 │ Monitoring dashboards │
└─────────────────┴──────────┴──────────────────────────────────────────┘
Access the monitoring dashboards:
- Grafana: http://localhost:3007 (admin/admin)
- Prometheus: http://localhost:9090
- Ollama API: http://localhost:11435
- ONNX Runtime: http://localhost:8001
We provide test scripts to verify all services are functioning correctly:
-
Basic Service Tests - Verifies all core services are running and accessible:
# Run all tests make test # Or run individual tests ./test_services.sh
-
ONNX Runtime Tests - Test ONNX Runtime functionality:
# Check ONNX Runtime status make onnx-status # Test with a sample request make onnx-test
-
ONNX Model Test - Validates ONNX model loading and inference (requires Python dependencies):
python3 -m pip install -r requirements-test.txt python3 test_onnx_model.py
-
API Endpoint Tests - Comprehensive API tests (requires Python dependencies):
python3 test_endpoints.py
When all services are running correctly, you should see output similar to:
=== Testing Direct Endpoints ===
Testing Ollama API (http://localhost:11435/api/tags)... PASS (Status: 200)
Testing ONNX Runtime (http://localhost:8001/v1/)... PASS (Status: 405)
=== Testing Through Nginx Gateway ===
Testing Nginx -> Ollama (http://localhost:30080/api/tags)... PASS (Status: 200)
Testing Nginx -> ONNX Runtime (http://localhost:30080/v1/)... PASS (Status: 405)
Testing Nginx Health Check (http://localhost:30080/health)... PASS (Status: 200)
=== Testing Monitoring ===
Testing Prometheus (http://localhost:9090)... PASS (Status: 302)
Testing Prometheus Graph (http://localhost:9090/graph)... PASS (Status: 200)
Testing Grafana (http://localhost:3007)... PASS (Status: 302)
Testing Grafana Login (http://localhost:3007/login)... PASS (Status: 200)
Note: A 405 status for ONNX Runtime is expected for GET requests to /v1/ as it requires POST requests for inference. The 302 status codes for Prometheus and Grafana are expected redirects to their respective UIs.
# Stop all services
make stop
# Remove all containers and volumes
make clean
# Remove all unused Docker resources
make prune# List loaded models
make onnx-models
# To remove models, simply delete them from the models/ directory
rm models/*.onnxThis project is licensed under the Apache Software License - see the LICENSE file for details.
- Multi-Model Serving: Run multiple AI/ML models simultaneously
- Optimized Inference: ONNX Runtime for high-performance model execution
- LLM Support: Ollama integration for local LLM deployment
- Monitoring: Built-in Prometheus and Grafana for observability
- Scalable: Kubernetes-native design for easy scaling
- Developer-Friendly: Simple CLI and comprehensive API
- Overview - Platform architecture and components
- Quick Start - Get up and running in minutes
- Installation Guide - Detailed setup instructions
- Ollama Basic Usage - Running LLM models
- ONNX Runtime Guide - Deploying custom ONNX models
- API Reference - Complete API documentation
- Model Optimization - Performance tuning
- Monitoring - Setting up alerts and dashboards
- Security - Best practices for secure deployment
- Docker and Docker Compose
- 8GB+ RAM (16GB recommended)
- 20GB free disk space
# Clone the repository
git clone https://github.com/wronai/edge.git
cd edge
# Start all services
make up
# Check service status
make status- API Gateway: http://localhost:30080
- Grafana: http://localhost:3007 (admin/admin)
- Prometheus: http://localhost:9090
edge/
├── docs/ # Documentation
├── configs/ # Configuration files
├── k8s/ # Kubernetes manifests
├── scripts/ # Utility scripts
├── terraform/ # Infrastructure as Code
├── docker-compose.yml # Local development
└── Makefile # Common tasks
# Start services
make up
# Stop services
make down
# View logs
make logs
# Access monitoring
make monitor
# Run tests
make testContributions are welcome! Please see our Contributing Guide for details.
This project is licensed under the Apache Software License - see the LICENSE file for details.
For support or questions, please open an issue in the repository.
- Docker Desktop (running)
- Terraform >= 1.6
- kubectl >= 1.28
- 8GB RAM minimum
# Clone and deploy
git clone https://github.com/wronai/edge.git
cd edge
# Make script executable and deploy everything
chmod +x scripts/deploy.sh
./scripts/deploy.sh🎯 Result: Complete edge AI platform with monitoring in ~3-5 minutes
docker compose psoutput:
docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
edge-grafana-1 grafana/grafana:latest "/run.sh" grafana 3 days ago Up 8 minutes 0.0.0.0:3007->3000/tcp, :::3007->3000/tcp
edge-ollama-1 ollama/ollama:latest "/bin/sh -c 'sleep 1…" ollama 3 days ago Up 8 minutes 0.0.0.0:11435->11434/tcp, :::11435->11434/tcp
edge-prometheus-1 prom/prometheus:latest "/bin/prometheus --c…" prometheus 3 days ago Up 8 minutes 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp- 🤖 AI Gateway: http://localhost:30080
- 📊 Grafana: http://localhost:30030 (admin/admin)
- 📈 Prometheus: http://localhost:30090
wronai_edge-portfolio/
├── terraform/main.tf # Infrastruktura (K3s + Docker)
├── k8s/ai-platform.yaml # AI workloady (ONNX + Ollama)
├── k8s/monitoring.yaml # Monitoring (Prometheus + Grafana)
├── configs/Modelfile # Custom LLM konfiguracja
├── scripts/deploy.sh # Automatyzacja (jeden skrypt)
└── README.md # Kompletna dokumentacjagraph TB
U[User] --> G[AI Gateway :30080]
G --> O[ONNX Runtime]
G --> L[Ollama LLM]
P[Prometheus :30090] --> O
P --> L
P --> G
GR[Grafana :30030] --> P
subgraph "K3s Cluster"
O
L
G
P
GR
end
subgraph "Infrastructure"
T[Terraform] --> K[K3s]
K --> O
K --> L
end
| Layer | Technology | Purpose |
|---|---|---|
| Infrastructure | Terraform + Docker | IaC provisioning |
| Orchestration | K3s (Lightweight Kubernetes) | Container management |
| AI Inference | ONNX Runtime + Ollama | Model serving |
| Load Balancing | Nginx Gateway | Traffic routing |
| Monitoring | Prometheus + Grafana | Observability |
| Automation | Bash + YAML | Deployment scripts |
# Check if the ONNX Runtime service is healthy
curl -X GET http://localhost:8001/
# Expected Response: "Healthy"# List available models in the models directory
make onnx-models
# Check model status
make onnx-model-status
# Get model metadata
make onnx-model-metadata# Make a prediction using the default model (complex-cnn-model)
make onnx-predict
# Or use curl directly
curl -X POST http://localhost:8001/v1/models/complex-cnn-model/versions/1:predict \
-H "Content-Type: application/json" \
-d '{"instances": [{"data": [1.0, 2.0, 3.0, 4.0]}]}'
# Example with Python
python3 -c "
import requests
import json
response = requests.post(
'http://localhost:8001/v1/models/complex-cnn-model/versions/1:predict',
json={"instances": [{"data": [1.0, 2.0, 3.0, 4.0]}]}
)
print(json.dumps(response.json(), indent=2))
"# Run a benchmark with 100 requests
make onnx-benchmark
# Customize model and version
make onnx-benchmark MODEL_NAME=my-model MODEL_VERSION=2- The server automatically loads models from the
/modelsdirectory in the container - To use a different model:
- Place your
.onnxmodel file in the./modelsdirectory - Update the model name/version in your requests or set environment variables:
export MODEL_NAME=your-model export MODEL_VERSION=1
- Or specify them when running commands:
make onnx-predict MODEL_NAME=your-model MODEL_VERSION=1
- Place your
# Simple chat
curl -X POST http://localhost:30080/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:1b",
"prompt": "Explain edge computing",
"stream": false
}'
# Custom edge AI assistant
curl -X POST http://localhost:30080/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "wronai_edge-assistant",
"prompt": "How do I monitor Kubernetes pods?",
"stream": false
}'# Run comprehensive AI functionality test
./scripts/deploy.sh demo
# Test individual components
./scripts/deploy.sh testoutput:
# Test individual components
./scripts/deploy.sh test
[ERROR] 19:27:54 Unknown command: demo
[INFO] 19:27:54 Run './scripts/deploy.sh help' for usage information
[STEP] 19:27:54 🔍 Testing deployed services...
[INFO] 19:27:54 Testing service endpoints...
[ERROR] 19:27:54 ❌ AI Gateway: FAILED
[WARN] 19:27:54 ⚠️ Ollama: Not ready (may still be starting)
[WARN] 19:27:54 ⚠️ ONNX Runtime: Not ready
[INFO] 19:27:54 ✅ Prometheus: OK
[INFO] 19:27:54 ✅ Grafana: OK
[INFO] 19:27:54 Testing AI functionality...
[WARN] 19:27:54 ⚠️ AI Generation: Model may still be downloading
[WARN] 19:27:54 ⚠️ Some services need more time to startRun a diagnosis to check your system:
./scripts/deploy.sh diagnoseoutput:
...
- context:
cluster: kind-wronai_edge
user: kind-wronai_edge
[STEP] 19:32:14 🔍 Testing service connectivity...
//localhost:30080/health:AI Gateway: ❌ NOT RESPONDING
//localhost:30090/-/healthy:Prometheus: ❌ NOT RESPONDING
//localhost:30030/api/health:Grafana: ❌ NOT RESPONDING
//localhost:11435/api/tags:Ollama Direct: ❌ NOT RESPONDING
//localhost:8001/v1/models:ONNX Direct: ❌ NOT RESPONDING
[STEP] 19:32:14 🔍 Diagnosis complete!Fix and deploy the services:
./scripts/deploy.sh fixTest the services after deployment:
./scripts/deploy.sh test- URL: http://localhost:30030
- Login: admin/admin
- Features:
- Real-time AI inference metrics
- Resource utilization monitoring
- Request latency distribution
- Error rate tracking
- Pod health status
- URL: http://localhost:30090
- Key Metrics:
http_requests_total- Request countershttp_request_duration_seconds- Latency histogramscontainer_memory_usage_bytes- Memory consumptioncontainer_cpu_usage_seconds_total- CPU utilization
# Comprehensive health check
./scripts/deploy.sh health
# Check specific components
kubectl get pods -A
kubectl top nodes
kubectl top pods -A# Check deployment status
./scripts/deploy.sh info
# View live logs
kubectl logs -f deployment/ollama-llm -n ai-inference
kubectl logs -f deployment/onnx-inference -n ai-inference
# Scale AI services
kubectl scale deployment onnx-inference --replicas=3 -n ai-inference
# Update configurations
kubectl apply -f k8s/ai-platform.yaml1. Disk Space Issues If the deployment fails with eviction errors or the cluster won't start:
# Check disk space
df -h
# Clean up Docker system
docker system prune -a -f --volumes
# Remove unused containers, networks, and images
docker container prune -f
docker image prune -a -f
docker network prune -f
docker volume prune -f
# Clean up old logs and temporary files
sudo journalctl --vacuum-time=3d
sudo find /var/log -type f -name "*.gz" -delete
sudo find /var/log -type f -name "*.1" -delete2. Debugging K3s Cluster
# Check K3s server logs
docker logs k3s-server
# Check cluster status
docker exec k3s-server kubectl get nodes
docker exec k3s-server kubectl get pods -A3. Port Conflicts If you see port binding errors, check and free up required ports (80, 443, 6443, 30030, 30090, 30080):
# Check port usage
sudo lsof -i :8080 # Replace with your port number4. Debugging Pods
# Debug pod issues
kubectl describe pod <pod-name> -n ai-inference
# Check resource usage
kubectl top pods -n ai-inference --sort-by=memory
# View events
kubectl get events -n ai-inference --sort-by='.lastTimestamp'
# Restart services
kubectl rollout restart deployment/ollama-llm -n ai-inference5. Reset Everything If you need to start fresh:
# Clean up all resources
./scripts/deploy.sh cleanup
# Remove all Docker resources
docker system prune -a --volumes --force
# Remove K3s data
sudo rm -rf terraform/kubeconfig/*
sudo rm -rf terraform/k3s-data/*
sudo rm -rf terraform/registry-data/*# Complete cleanup
./scripts/deploy.sh cleanup
# Partial cleanup (keep infrastructure)
kubectl delete -f k8s/monitoring.yaml
kubectl delete -f k8s/ai-platform.yamlwronai_edge-portfolio/
├── terraform/
│ └── main.tf # Complete infrastructure as code
├── k8s/
│ ├── ai-platform.yaml # AI workloads (ONNX + Ollama + Gateway)
│ └── monitoring.yaml # Observability stack (Prometheus + Grafana)
├── configs/
│ └── Modelfile # Custom LLM configuration
├── scripts/
│ └── deploy.sh # Automation script (8 commands)
└── README.md # This documentation
Total Files: 6 core files + documentation = Minimal complexity, maximum demonstration
- ✅ Infrastructure as Code - Pure Terraform configuration
- ✅ Container Orchestration - Kubernetes/K3s with proper manifests
- ✅ Declarative Automation - YAML-driven deployments
- ✅ Monitoring & Observability - Production-ready metrics
- ✅ Security Best Practices - RBAC, network policies, resource limits
- ✅ Scalability Patterns - HPA, resource management
- ✅ GitOps Ready - Declarative configuration management
- ✅ Model Serving - ONNX Runtime for optimized inference
- ✅ LLM Deployment - Ollama with custom model configuration
- ✅ Edge Computing - Resource-constrained deployment patterns
- ✅ Load Balancing - Intelligent traffic routing for AI services
- ✅ Performance Monitoring - AI-specific metrics and alerting
- ✅ Microservices Architecture - Service mesh ready
- ✅ Cloud Native - CNCF-aligned tools and patterns
- ✅ Edge Computing - Lightweight, distributed deployments
- ✅ Observability - Three pillars (metrics, logs, traces)
- ✅ Automation - Zero-touch deployment and operations
# Add new ONNX model
kubectl create configmap wronai --from-file=model.onnx -n ai-inference
# Update deployment to mount the model
# Create custom Ollama model
kubectl exec -n ai-inference deployment/ollama-llm -- \
ollama create my-custom-model -f /path/to/Modelfile# Multi-node cluster
# Update terraform/main.tf to add worker nodes
# Persistent storage
# Add PVC configurations for model storage
# External load balancer
# Configure LoadBalancer service type
# TLS termination
# Add cert-manager and ingress controller# Add custom metrics
# Extend Prometheus configuration
# Custom dashboards
# Add Grafana dashboard JSON files
# Alerting rules
# Configure AlertManager for notifications- Total Memory: ~4GB (K3s + AI services + monitoring)
- CPU Usage: ~2 cores (under load)
- Storage: ~2GB (container images + models)
- Network: Minimal (edge-optimized)
- Deployment Time: 3-5 minutes (cold start)
- AI Response Time: <2s (LLM inference)
- Monitoring Latency: <100ms (metrics collection)
- Scaling Time: <30s (pod autoscaling)
- Model Quantization: 4x memory reduction with ONNX INT8
- Caching: Redis for frequently accessed inference results
- Batching: Group inference requests for better throughput
- GPU Acceleration: CUDA/ROCm support for faster inference
- Practical Skills: Real-world DevOps patterns, not toy examples
- Modern Stack: Current best practices and CNCF-aligned tools
- AI Integration: Demonstrates understanding of ML deployment challenges
- Production Ready: Monitoring, scaling, security considerations
- Time Efficient: Complete demo in under 5 minutes
- Minimal Complexity: 6 core files, maximum clarity
- Declarative Approach: Infrastructure and workloads as code
- Extensible Architecture: Easy to add features and scale
- Edge Optimized: Real-world resource constraints considered
- Documentation: Clear instructions and troubleshooting guides
- Fast Deployment: Rapid prototyping and development cycles
- Cost Effective: Efficient resource utilization
- Scalable Design: Grows from demo to production
- Risk Mitigation: Proven patterns and reliable automation
- Innovation Ready: Foundation for AI/ML initiatives
Tom Sapletta - DevOps Engineer & AI Integration Specialist
- 🔧 15+ years enterprise DevOps experience
- 🤖 AI/LLM deployment expertise with edge computing focus
- 🏗️ Infrastructure as Code advocate and practitioner
- 📊 Monitoring & Observability specialist
- 🚀 Kubernetes & Cloud Native architect
Current Focus: Telemonit - Edge AI power supply systems with integrated LLM capabilities
This project demonstrates practical DevOps skills through minimal, production-ready code that showcases Infrastructure as Code, AI integration, and modern container orchestration patterns. Perfect for demonstrating technical competency to potential employers in the DevOps and AI engineering space.
This project is open source and available under the Apache License.
🎯 Ready to deploy? Run ./scripts/deploy.sh and see it in action!