π Features β’ π Installation β’ ποΈ Architecture β’ π Documentation β’ π€ Contributing
- Overview
- Features
- Architecture
- Installation
- Quick Start
- Configuration
- API Reference
- Deployment
- Development
- Testing
- Contributing
- License
AgentOps is an autonomous, safety-first MLOps orchestrator that enables intelligent deployment of AI/ML models to Amazon SageMaker using natural language commands. The system leverages NVIDIA NIMs (NVIDIA Inference Microservices) for LLM reasoning and RAG (Retrieval-Augmented Generation), implementing a three-layer safety framework for production-grade autonomous operations.
- π§ Agentic AI Architecture: Multi-agent system with Planner, Executor, Monitor, and Retriever agents
- π Safety-First Design: Three-layer guardrails (Validation, Human-in-the-Loop, Immutable Audit)
- π Natural Language Interface: Deploy models using simple commands like "deploy llama-3.1 8B for chatbot-x"
- π Autonomous Orchestration: AI agents automatically generate deployment configurations with reasoning
- π Real-Time Dashboard: Beautiful React UI with live monitoring, workflow visualization, and agent logs
- π‘οΈ Enterprise-Grade: Comprehensive audit logging, immutable trails, and production-ready infrastructure
-
π― Natural Language Deployment
- Convert human intent into structured deployment plans
- Support for complex multi-step workflows
- Intelligent command parsing and execution
-
π§ Agentic AI System
- Planner Agent: Generates execution plans with chain-of-thought reasoning
- Executor Agent: Handles deployment execution with ReAct pattern
- Monitor Agent: Tracks deployment health and triggers rollbacks
- Retriever Agent: RAG-powered policy retrieval for grounded decisions
-
π Agent Memory System
- Episodic memory for learning from past deployments
- Semantic memory for pattern recognition
- DynamoDB persistence with automatic TTL
-
π Real-Time Dashboard
- Workflow Designer with ReactFlow visualization
- Agent execution logs with reasoning chains
- Deployment status tracking and metrics
- Dark/Light mode support
-
π‘οΈ Safety Layers
- Layer 1: Guardrails with schema, budget, and policy validation
- Layer 2: Human-in-the-Loop approvals for production deployments
- Layer 3: Immutable audit trail via DynamoDB + CloudTrail + S3 Object Lock
-
π RAG-Powered Decision Making
- Two-stage retrieval (embedding + reranking) using NeMo Retriever NIM
- Policy-grounded deployments
- Context-aware planning
graph TB
subgraph "Frontend Layer"
UI[React Dashboard<br/>Workflow Designer]
API_CLIENT[API Client<br/>React Query]
end
subgraph "API Gateway"
FASTAPI[FastAPI Orchestrator<br/>REST API]
end
subgraph "Agent Orchestrator"
ORCHESTRATOR[Agent Orchestrator<br/>Multi-Agent Coordination]
PLANNER[Planner Agent<br/>Chain-of-Thought]
EXECUTOR[Executor Agent<br/>ReAct Pattern]
MONITOR[Monitor Agent<br/>Health Tracking]
RETRIEVER[Retriever Agent<br/>RAG Engine]
end
subgraph "NVIDIA NIM Services"
LLM_NIM[llama-3.1-nemotron-nano-8B-v1<br/>LLM NIM]
EMBED_NIM[NeMo Retriever<br/>Embedding NIM]
RERANK_NIM[NeMo Retriever<br/>Reranking NIM]
end
subgraph "AWS Services"
SAGEMAKER[SageMaker<br/>Model Endpoints]
DYNAMODB[DynamoDB<br/>Plans & Audit Logs]
S3[S3 Bucket<br/>CloudTrail Logs]
CLOUDTRAIL[CloudTrail<br/>Immutable Audit]
end
subgraph "Safety Layers"
GUARDRAILS[Guardrail Service<br/>Validation]
APPROVAL[HITL Approval<br/>Queue]
AUDIT[Audit Logger<br/>Immutable Trail]
end
UI --> API_CLIENT
API_CLIENT --> FASTAPI
FASTAPI --> ORCHESTRATOR
ORCHESTRATOR --> PLANNER
ORCHESTRATOR --> EXECUTOR
ORCHESTRATOR --> MONITOR
ORCHESTRATOR --> RETRIEVER
PLANNER --> LLM_NIM
RETRIEVER --> EMBED_NIM
RETRIEVER --> RERANK_NIM
FASTAPI --> GUARDRAILS
FASTAPI --> APPROVAL
FASTAPI --> AUDIT
EXECUTOR --> SAGEMAKER
AUDIT --> DYNAMODB
AUDIT --> CLOUDTRAIL
CLOUDTRAIL --> S3
GUARDRAILS -.-> SAGEMAKER
MONITOR -.-> SAGEMAKER
style UI fill:#61DAFB
style FASTAPI fill:#005571
style LLM_NIM fill:#76B900
style SAGEMAKER fill:#FF9900
style DYNAMODB fill:#232F3E
style GUARDRAILS fill:#FF6B6B
sequenceDiagram
participant User
participant Frontend
participant Orchestrator
participant Retriever as Retriever Agent
participant Planner as Planner Agent
participant Executor as Executor Agent
participant Guardrails
participant Monitor as Monitor Agent
participant SageMaker
User->>Frontend: Submit Command<br/>"deploy llama-3.1 8B"
Frontend->>Orchestrator: POST /api/agent/command
Orchestrator->>Retriever: Retrieve Policies
Retriever->>Retriever: Embed Query
Retriever->>Retriever: Rerank Results
Retriever-->>Orchestrator: RAG Evidence
Orchestrator->>Planner: Generate Execution Plan
Planner->>Planner: Chain-of-Thought Reasoning
Planner->>Planner: Create Task Steps
Planner-->>Orchestrator: Execution Plan
Orchestrator->>Guardrails: Validate Plan
Guardrails->>Guardrails: Schema Validation
Guardrails->>Guardrails: Budget Check
Guardrails->>Guardrails: Policy Validation
Guardrails-->>Orchestrator: Validation Result
alt Validation Passed
alt Requires Approval
Orchestrator->>Frontend: Pending Approval
Frontend->>User: Show Approval UI
User->>Frontend: Approve/Reject
Frontend->>Orchestrator: POST /approve
end
Orchestrator->>Executor: Execute Deployment
Executor->>SageMaker: Create Model
Executor->>SageMaker: Create Endpoint Config
Executor->>SageMaker: Create Endpoint
SageMaker-->>Executor: Endpoint Created
Executor->>Monitor: Configure Monitoring
Monitor->>SageMaker: Setup CloudWatch Alarms
Executor-->>Orchestrator: Deployment Result
Orchestrator-->>Frontend: Success Response
Frontend-->>User: Show Status
else Validation Failed
Guardrails-->>Orchestrator: Validation Errors
Orchestrator-->>Frontend: Error Response
Frontend-->>User: Show Errors
end
graph LR
subgraph "Backend Services"
A[FastAPI Orchestrator] --> B[Agent Orchestrator]
B --> C[Planner Agent]
B --> D[Executor Agent]
B --> E[Monitor Agent]
B --> F[Retriever Agent]
end
subgraph "AI Services"
C --> G[LLM NIM]
F --> H[Embedding NIM]
F --> I[Reranking NIM]
end
subgraph "Data Layer"
A --> J[Plans Storage<br/>DynamoDB]
A --> K[Agent Memory<br/>DynamoDB]
A --> L[Audit Logger<br/>DynamoDB]
end
subgraph "Execution"
D --> M[SageMaker Tool]
M --> N[SageMaker<br/>Endpoints]
end
subgraph "Safety"
A --> O[Guardrail Service]
A --> P[Approval Queue]
L --> Q[CloudTrail]
end
style A fill:#005571
style G fill:#76B900
style N fill:#FF9900
style O fill:#FF6B6B
- Python 3.11+ - Download Python
- Node.js 18+ - Download Node.js
- AWS Account with appropriate permissions
- AWS CLI configured - Install AWS CLI
- Git - Download Git
git clone https://github.com/ashutosh0x/AgentOps-AWS.git
cd AgentOps-AWS# Linux/macOS
bash scripts/setup_dev.sh
# Windows PowerShell
.\scripts\setup_dev.sh# Create virtual environment
python -m venv venv
# Activate virtual environment
# Linux/macOS:
source venv/bin/activate
# Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtcd frontend
# Install dependencies
npm install
# For development
npm run dev
# For production build
npm run buildCreate a .env file in the root directory:
cp .env.example .env # If .env.example existsEdit .env with your configuration:
# AWS Configuration (Required)
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
# SageMaker Endpoints (Required - Deploy via SageMaker JumpStart)
LLM_ENDPOINT=llama-3.1-nemotron-nano-8b-v1-endpoint
RETRIEVER_EMBED_ENDPOINT=nemo-retriever-embed-endpoint
RETRIEVER_RERANK_ENDPOINT=nemo-retriever-rerank-endpoint
# DynamoDB (Optional - Auto-created if not exists)
DYNAMODB_TABLE_NAME=agentops-audit-log
DYNAMODB_PLANS_TABLE_NAME=agentops-plans
DYNAMODB_MEMORY_TABLE_NAME=agentops-agent-memory
# Execution Mode (Safety)
EXECUTE=false # Set to "true" for actual deployments (default: false for safety)
# Frontend (Optional)
VITE_API_URL=http://localhost:8000
# Agent Memory (Optional)
AGENT_MEMORY_EXPIRATION_DAYS=90- Open AWS SageMaker Console
- Navigate to JumpStart
- Search for "llama-3.1-nemotron-nano-8B-v1"
- Deploy to endpoint and note the endpoint name
- Repeat for NeMo Retriever Embedding and Reranking NIMs
- Update
.envwith endpoint names
- Subscribe to NVIDIA NIM microservices
- Deploy via SageMaker
- Note endpoint names
python scripts/upload_docs.pyThis ingests sample policies into the retriever for RAG grounding.
# Activate virtual environment first
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Start FastAPI server
uvicorn orchestrator.main:app --reload --host 0.0.0.0 --port 8000
# Or using Make
make devThe API will be available at http://localhost:8000
- API Documentation:
http://localhost:8000/docs - Health Check:
http://localhost:8000/health
cd frontend
npm run devThe dashboard will be available at http://localhost:5173
# Run demo script
bash demo/demo.sh
# Or using Make
make demo| Variable | Required | Default | Description |
|---|---|---|---|
AWS_REGION |
Yes | - | AWS region for deployments |
LLM_ENDPOINT |
Yes* | - | SageMaker endpoint for LLM NIM |
RETRIEVER_EMBED_ENDPOINT |
Yes* | - | SageMaker endpoint for NeMo Retriever Embedding |
RETRIEVER_RERANK_ENDPOINT |
Yes* | - | SageMaker endpoint for NeMo Retriever Reranking |
DYNAMODB_TABLE_NAME |
No | agentops-audit-log |
DynamoDB table for audit logs |
DYNAMODB_PLANS_TABLE_NAME |
No | agentops-plans |
DynamoDB table for deployment plans |
DYNAMODB_MEMORY_TABLE_NAME |
No | agentops-agent-memory |
DynamoDB table for agent memory |
EXECUTE |
No | false |
Enable actual deployments (set to true carefully) |
VITE_API_URL |
No | http://localhost:8000 |
Backend API URL for frontend |
AGENT_MEMORY_EXPIRATION_DAYS |
No | 90 |
TTL for agent memory entries |
*Can use mock mode if endpoints not configured
Minimum required IAM permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateModel",
"sagemaker:CreateEndpointConfig",
"sagemaker:CreateEndpoint",
"sagemaker:DescribeModel",
"sagemaker:DescribeEndpointConfig",
"sagemaker:DescribeEndpoint",
"sagemaker:InvokeEndpoint",
"sagemaker:DeleteModel",
"sagemaker:DeleteEndpointConfig",
"sagemaker:DeleteEndpoint",
"dynamodb:PutItem",
"dynamodb:GetItem",
"dynamodb:Query",
"dynamodb:Scan",
"dynamodb:DeleteItem",
"dynamodb:CreateTable",
"dynamodb:DescribeTable",
"s3:GetObject",
"s3:PutObject",
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricStatistics"
],
"Resource": "*"
}
]
}Submit a natural language deployment command.
Request:
{
"command": "deploy llama-3.1 8B for chatbot-x",
"user_id": "alice@example.com",
"env": "staging",
"constraints": {
"budget_usd_per_hour": 15.0
}
}Response:
{
"command_id": "uuid",
"status": "success",
"result": {
"plan_id": "uuid",
"status": "deploying",
"artifact": {
"endpoint_name": "chatbot-x-staging",
"model_name": "llama-3.1-8b",
"instance_type": "ml.m5.large",
"instance_count": 1
}
}
}List all deployment plans.
Response:
{
"deployments": [
{
"plan_id": "uuid",
"status": "deployed",
"intent": "deploy llama-3.1 8B",
"env": "staging",
"created_at": "2024-01-01T00:00:00Z"
}
],
"count": 1
}Get detailed deployment plan with reasoning steps.
Response:
{
"plan": {
"plan_id": "uuid",
"status": "deployed",
"reasoning_steps": [
{
"step_id": "uuid-step-1",
"agent_type": "planner",
"action": "generate_config",
"status": "completed",
"reasoning_chain": {
"agent_name": "Planner Agent",
"steps": [
{
"thought": "Planning deployment...",
"reasoning": "Generated configuration",
"confidence": 0.85
}
]
}
}
]
}
}Pause a running deployment.
Restart a paused or failed deployment.
Delete a deployment (with optional hard delete).
Query Parameters:
hard_delete(boolean): Also delete SageMaker resources and agent memory
See Quick Start section.
# Build and deploy
docker build -t agentops-orchestrator .
docker tag agentops-orchestrator:latest <your-ecr-repo>/agentops:latest
docker push <your-ecr-repo>/agentops:latest
# Deploy using App Runner
aws apprunner create-service --cli-input-yaml file://apprunner.yaml# Package for Lambda
./deploy_lambda.ps1 # Windows
# or
bash lambda_deploy.sh # Linux/macOSSee deploy/terraform/ for Terraform configurations.
cd frontend
npm run build
# Deploy dist/ foldercd frontend
npm run build
aws s3 sync dist/ s3://your-bucket-nameAgentOps-AWS/
βββ orchestrator/ # Backend orchestrator
β βββ main.py # FastAPI application
β βββ agent_orchestrator.py # Multi-agent coordination
β βββ agents/ # Individual agents
β β βββ planner_agent.py
β β βββ executor_agent.py
β β βββ monitoring_agent.py
β β βββ __init__.py
β βββ agent_memory.py # Agent memory system
β βββ tool_registry.py # Dynamic tool discovery
β βββ llm_client.py # LLM NIM client
β βββ retriever_client.py # RAG retriever client
β βββ guardrail.py # Safety guardrails
β βββ sage_tool.py # SageMaker deployment tool
β βββ audit.py # Audit logging
β βββ plans_storage.py # DynamoDB persistence
β βββ models.py # Pydantic schemas
βββ frontend/ # React frontend
β βββ src/
β β βββ components/ # React components
β β β βββ WorkflowDesigner.tsx
β β β βββ ExecutionPanel.tsx
β β β βββ WorkflowGraph.tsx
β β β βββ ...
β β βββ lib/ # Utilities
β βββ package.json
βββ tests/ # Test suite
βββ deploy/ # Deployment scripts
β βββ terraform/ # Infrastructure as code
βββ scripts/ # Utility scripts
βββ docs/ # Documentation
βββ requirements.txt # Python dependencies
# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=orchestrator --cov-report=html
# Run specific test file
pytest tests/test_orchestrator_flow.py -v# Format code (black)
black orchestrator/
# Lint code (flake8)
flake8 orchestrator/
# Type checking (mypy)
mypy orchestrator/pytest tests/test_schemas.py -vpytest tests/test_orchestrator_flow.py -v# Start backend
uvicorn orchestrator.main:app --reload
# In another terminal, run demo
bash demo/demo.sh-
Guardrails (Layer 1)
- Schema validation
- Budget constraints
- Policy compliance
- Cost estimation
-
Human-in-the-Loop (Layer 2)
- Production deployments require approval
- High-cost deployments require approval
- Timeout and escalation policies
-
Immutable Audit (Layer 3)
- DynamoDB application logs
- CloudTrail data events
- S3 Object Lock for immutability
- β IAM roles with least privilege
- β Encryption at rest (DynamoDB, S3)
- β Encryption in transit (TLS)
- β Audit logging for all actions
- β No hardcoded credentials
- β Environment-based configuration
- Deployment success/failure rates
- Agent execution times
- Cost tracking per deployment
- Approval queue length
- Structured JSON logging
- Agent reasoning chains
- Execution traces
- Error tracking
- Real-time deployment status
- Agent activity monitoring
- Cost analytics
- Audit trail viewer
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 for Python code
- Use TypeScript for frontend code
- Write tests for new features
- Update documentation
- Follow conventional commit messages
This project is licensed under the MIT License - see the LICENSE file for details.
- NVIDIA for NIM microservices and AI infrastructure
- AWS for SageMaker, DynamoDB, and cloud services
- Open Source Community for amazing tools and libraries
- π§ Email: ashutoshkumarsingh951@gmail.com
- π¬ Issues: GitHub Issues
- π Documentation: Wiki
If you find this project useful, please consider giving it a star β