A production-grade platform for monitoring, validating, and maintaining ML models in real-world environments with comprehensive observability and alerting.
- Model Observability: Comprehensive metrics collection for model performance monitoring
- Data Validation: Schema validation and data quality checks for production traffic
- Drift Detection: Statistical methods to detect and quantify data drift in real-time
- Model Registry: Version management and lifecycle tracking for ML models
- Performance Visualization: Pre-built Grafana dashboards for real-time monitoring
- Alerting: Configurable alerts for model degradation and data quality issues
- Testing Framework: End-to-end and integration test suites to validate platform functionality
This platform implements a layered architecture:
- API Layer: FastAPI for model serving and monitoring endpoints with built-in observability
- Validation Layer: Automated schema and data drift checks for data quality assurance
- Monitoring Layer: Prometheus for metrics collection and storage with custom metrics
- Model Registry: MLflow for versioning, metadata, and lifecycle management
- Visualization Layer: Grafana dashboards with performance visualization and alerting
The architecture follows best practices for observability with separation of concerns and modular design:
mlops-observability/
βββ README.md
βββ architecture.png
βββ docker-compose.yml
βββ src/
β βββ api/
β βββ monitoring/
β βββ data_validation/
β βββ model_registry/
β βββ dashboard/
βββ tests/
The Model Registry component provides a centralized repository for model versioning, metadata tracking, and lifecycle management. Key features include:
- Model versioning and storage
- Model metadata and lineage tracking
- Performance metrics comparison between versions
- Stage transitions (development β staging β production)
- Integration with monitoring systems for observability
from src.model_registry.client import ModelRegistry
from src.model_registry.version import compare_model_versions
# Initialize registry client
registry = ModelRegistry(tracking_uri="http://mlflow-server:5000")
# Register a new model
model_uri = registry.register_model(
model_path="s3://models/model.pkl",
name="fraud_detection",
tags={"algorithm": "xgboost", "owner": "data-science-team"}
)
# Compare different model versions
comparison = compare_model_versions(
registry,
model_name="fraud_detection",
version1="1",
version2="2",
metric="auc"
)- Docker and Docker Compose
- Python 3.8+
- Git
-
Clone the repo:
git clone https://github.com/hmm29/mlops-observability.git cd mlops-observability -
Start the monitoring stack:
docker-compose -f docker/docker-compose-grafana.yml up -d
-
Access services:
- API:
http://localhost:8000 - Grafana:
http://localhost:3000(login: admin/admin) - Prometheus:
http://localhost:9090
- API:
For detailed setup instructions, see SETUP.md.
-
Comprehensive Metrics Collection:
- Real-time performance metrics tracking
- Request volume and latency monitoring
- Error tracking and categorization
-
Data Quality Monitoring:
- Feature drift detection using KS tests
- Input validation and schema enforcement
- Automated anomaly detection
-
Visualization & Alerting:
- Interactive Grafana dashboards with performance panels
- Real-time alerting for drift detection
- Customizable notification channels
β Completed Components:
- Model Registry implementation
- Data Validation (Drift Detection and Schema Validation)
- Unit tests for model registry
- Metrics Collection Service
- Prometheus metrics collector
- Model performance metrics
- System health monitoring
- API Endpoints
- FastAPI model serving
- Prediction monitoring
- Swagger UI documentation
- Dashboard implementation
- Grafana dashboards for model monitoring
- Performance visualization panels
- Data drift monitoring
- Alerting configuration
- Feature drift alerts
- High error rate detection
- Latency threshold monitoring
- Comprehensive Testing
- Integration tests
- End-to-end tests
- Unit tests for model registry
- Documentation
- API documentation
- Setup instructions
- Deployment guides
-
Extended Monitoring Features:
- A/B testing support
- Multi-model comparison dashboards
- Custom user-defined metrics
-
Enhanced Testing:
- Load testing framework
- Chaos engineering tests
- Continuous integration pipelines
-
Advanced Features:
- Automated model retraining triggers
- Custom notification channels
- Advanced data quality monitoring
Our integration tests validate component interactions:
# Run integration tests
pytest tests/test_integration.pyThese tests verify:
- API prediction endpoints
- Input validation mechanisms
- Metrics collection
- Drift detection algorithms
Complete system validation is done through E2E tests:
# Run E2E tests
pytest tests/test_e2e.pyThese tests confirm:
- Full prediction workflow from request to visualization
- Data drift detection and alerting
- Grafana dashboard accessibility
- System resilience under various conditions
Detailed testing documentation is available in tests/README.md.
import requests
import json
# Send prediction request to the API
url = "http://localhost:8000/predict"
payload = {
"features": {
"feature1": 0.5,
"feature2": 1.0,
"feature3": "category_a"
},
"request_id": "test-123"
}
response = requests.post(url, json=payload)
prediction = response.json()
print(prediction)from src.monitoring.metrics import MLMetricsCollector
from src.data_validation.drift import DriftDetector
import pandas as pd
# Initialize metrics collector
metrics = MLMetricsCollector(model_name="fraud_detection", version="1.0")
# Track model performance
with metrics.track_predictions():
predictions = model.predict(features)
# Check for data drift
detector = DriftDetector(reference_data=training_data)
results = detector.detect_drift(current_data, threshold=0.05)
if results["drift_detected"]:
print(f"Drift detected in features: {results['flagged_features']}")from src.model_registry.client import ModelRegistry
registry = ModelRegistry(tracking_uri="http://mlflow-server:5000")
model_uri = registry.register_model(
model_path="path/to/model.pkl",
name="fraud_detection",
tags={"algorithm": "xgboost", "version": "1.0.0"}
)mlops-observability/
βββ docker/ # Docker configuration files
β βββ prometheus/ # Prometheus configuration
β βββ grafana/ # Grafana provisioning
βββ grafana/ # Grafana dashboards and alerts
β βββ dashboards/ # Dashboard templates
β βββ alerts/ # Alert configurations
βββ src/ # Source code
β βββ api/ # FastAPI service
β βββ monitoring/ # Metrics collection
β βββ data_validation/ # Schema and drift detection
β βββ model_registry/ # Model versioning
βββ tests/ # Test suite
βββ model_registry/ # Unit tests
βββ test_integration.py # Integration tests
βββ test_e2e.py # End-to-end tests
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Implement your changes
- Add tests for your implementation
- Update documentation as needed
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
The MLOps Observability Platform provides a comprehensive solution for monitoring ML models in production environments. With real-time performance tracking, data quality monitoring, and automated alerting, it helps teams maintain reliable AI systems at scale.
Key benefits:
- Early detection of model degradation
- Proactive data quality management
- Streamlined MLOps workflows
- Enhanced model reliability and user trust
For detailed setup and usage instructions, refer to SETUP.md.
This project is licensed under the MIT License - see the LICENSE file for details.
