An intelligent Kubernetes scheduler that combines Go backend with Python AI components to make advanced pod placement decisions using machine learning and historical data analysis.
This project implements an AI-enhanced Kubernetes scheduler that goes beyond traditional resource-based scheduling by incorporating:
- Machine Learning Predictions: Random Forest model for node selection
- Historical Data Analysis: 7-day pod metrics cache for stability scoring
- Online Learning: Continuous model improvement through feedback
- Advanced Feature Engineering: 13 different features for comprehensive analysis
- Real-time Metrics: Kubernetes Metrics API integration
-
Go Backend (Port 8080)
- Kubernetes client integration
- Metrics collection and caching
- Node scoring algorithms
- REST API endpoints
-
Python AI (Port 5000)
- Machine learning model (Random Forest)
- Feature engineering and data processing
- Online learning with feedback loop
- Prediction API
-
Docker Compose
- Containerized deployment
- Health checks and monitoring
- Volume persistence for models and data
Kubernetes Cluster β Metrics API β Go Backend β PodMetricsCache
- Collects real CPU/Memory usage from Kubernetes Metrics API
- Caches pod metrics for 7 days with analysis
- Tracks pod restart rates, failure rates, and stability scores
PodMetricsCache β DataProcessor β Feature Extraction β ML Model
- Extracts 13 different features:
- Pod requirements (CPU/Memory requests)
- Node usage (CPU/Memory utilization)
- Cluster state (total nodes, ready nodes, averages)
- Historical data (stability scores, failure rates)
- Resource pressure and health scores
ML Model β Prediction β Confidence Score β Node Selection
- Random Forest model trained on historical data
- Provides confidence scores and feature importance
- Fallback to enhanced scoring if ML model unavailable
Prediction β Feedback Collection β Performance Tracking β Model Updates
- Collects feedback on prediction accuracy
- Tracks daily performance metrics
- Updates model when performance degrades
- Resource availability check
- Node taints/tolerations
- Pod/node affinity rules
- Simple scoring based on available resources
- 13 Feature Analysis: Comprehensive node evaluation
- Historical Stability: 7-day pod metrics analysis
- ML Predictions: Random Forest model with confidence scores
- Online Learning: Continuous improvement through feedback
- Resource Pressure: Advanced cluster health analysis
- Failure Rate Prediction: Historical pod analysis
- Docker and Docker Compose
curlandjqfor testing
# Clone the repository
git clone <repository-url>
cd ai-scheduler
# Start all services
docker-compose up -d
# Check service status
docker-compose ps# Make test script executable
chmod +x scripts/test_system.sh
# Run comprehensive tests
./scripts/test_system.sh# Make demo script executable
chmod +x scripts/demo.sh
# Run interactive demo
./scripts/demo.sh# Health check
curl http://localhost:8080/health
# Get node metrics
curl http://localhost:8080/api/v1/metrics | jq# Health check
curl http://localhost:5000/health
# Get model info
curl http://localhost:5000/model/info | jq
# Make a prediction
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"pod_name": "test-pod",
"pod_namespace": "default",
"pod_spec": {
"containers": [{
"name": "app",
"resources": {
"requests": {
"cpu": "500m",
"memory": "512Mi"
}
}
}]
}
}' | jq
# Submit feedback
curl -X POST http://localhost:5000/feedback \
-H "Content-Type: application/json" \
-d '{
"prediction_result": {
"predicted_node": "minikube",
"confidence": 0.95,
"algorithm": "ml_prediction"
},
"actual_node": "minikube",
"success": true,
"pod_status": "Running"
}' | jq{
"go_backend": "healthy",
"python_ai": "healthy",
"accuracy": "100%",
"total_predictions": 5,
"successful_predictions": 5
}{
"predicted_node": "minikube",
"confidence": 1.0,
"algorithm": "ml_prediction",
"ai_features": {
"pod_requirements": {"cpu_request": 0.5, "memory_request": 512.0},
"cluster_state": {
"avg_cpu_usage": 45.2,
"avg_memory_usage": 62.8,
"health_score": 100.0,
"resource_pressure": 54.0
}
},
"node_predictions": [
{
"node_name": "minikube",
"resource_score": 0.4776,
"readiness_score": 1.0,
"stability_score": 1.0,
"ml_confidence": 1.0
}
]
}server:
host: "0.0.0.0"
port: 8080
read_timeout: 30s
write_timeout: 30s
kubernetes:
in_cluster: false
kubeconfig_path: "~/.kube/config"
metrics:
collection_interval: 30s
cache_duration: 168h # 7 days
scheduler:
ai_api_url: "http://python-ai:5000"
scoring:
cpu_weight: 30.0
memory_weight: 30.0
node_ready_weight: 20.0
taint_weight: 10.0
failed_pods_weight: 5.0
restart_weight: 5.0server:
host: "0.0.0.0"
port: 5000
model:
type: "random_forest"
max_depth: 10
n_estimators: 100
random_state: 42
online_learning:
feedback_threshold: 10
performance_threshold: 0.8
update_interval: 24h
data:
cache_duration: 168h # 7 days
feature_count: 13- Accuracy: 100% (5/5 successful predictions)
- Response Time: < 100ms for predictions
- Model Confidence: 1.0 (high confidence predictions)
- Feature Importance: 13 features analyzed
- Online Learning: Active feedback collection
- Stability Scoring: Based on 7-day pod history
- Resource Pressure: Cluster-wide health analysis
- Failure Rate Prediction: Historical pod analysis
- ML Model Performance: Continuous monitoring and updates
ai-scheduler/
βββ go/ # Go backend
β βββ cmd/main.go # Entry point
β βββ internal/ # Core logic
β β βββ api/ # HTTP routes
β β βββ collector/ # Metrics collection
β β βββ scheduler/ # AI scheduler logic
β β βββ types/ # Data structures
β βββ config/ # Configuration
βββ python/ # Python AI
β βββ api/app.py # Flask API
β βββ data/processor.py # Data processing
β βββ models/ # ML models
β βββ config/ # Configuration
βββ scripts/ # Test scripts
βββ docker-compose.yml # Container orchestration
βββ README.md # This file
# Build Go backend
cd go
go build -o main cmd/main.go
# Build Python AI
cd ../python
pip install -r requirements.txt
python api/app.py-
Port Conflicts
# Check if ports are in use lsof -i :8080 lsof -i :5000 # Kill processes if needed sudo kill -9 <PID>
-
Docker Container Issues
# Check container logs docker-compose logs go-backend docker-compose logs python-ai # Restart services docker-compose restart
-
Kubernetes Connection Issues
# Start Minikube if needed minikube start # Check Kubernetes connection kubectl get nodes
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Kubernetes client-go library
- Scikit-learn for ML models
- Flask for Python API
- Gin for Go API
- Docker for containerization
AI-Enhanced Kubernetes Scheduler - Making intelligent pod placement decisions with machine learning and historical data analysis.