Skip to content

Operations

Garot Conklin edited this page Apr 29, 2025 · 1 revision

CloudOpsAI Operations

Overview

CloudOpsAI operates as a serverless NOC agent using AWS Lambda and Bedrock for intelligent incident management. This guide covers key operational aspects of the system.

Operational Components

1. Lambda Function

  • Function Name: cloudopsai-agent
  • Memory: 1024MB
  • Timeout: 15 minutes
  • VPC: Private subnets with VPC endpoints

2. CloudWatch Integration

# View Lambda logs
aws logs tail /aws/lambda/cloudopsai-agent --follow

# Check alarm status
aws cloudwatch describe-alarms \
  --alarm-name-prefix "CloudOpsAI"

3. Cost Management

  • Lambda invocations (~$0.20/million)
  • Bedrock API calls (~$0.01/1K tokens)
  • CloudWatch logs ($0.50/GB)
  • S3 storage (minimal)

Routine Tasks

Daily

  • Check Lambda execution logs
  • Review AI decisions
  • Verify remediation success rates

Weekly

  • Review cost metrics
  • Update YAML rules if needed
  • Check for configuration drift

Monthly

  • Security review
  • Performance optimization
  • Rule effectiveness analysis

Incident Response

Common Issues

  1. High Lambda Latency

    • Check VPC endpoints
    • Review memory usage
    • Verify Bedrock availability
  2. Failed Remediation

    • Check IAM permissions
    • Verify target resource state
    • Review action logs
  3. AI Decision Quality

    • Review Bedrock prompts
    • Check historical data
    • Adjust thresholds

Monitoring Dashboard

Access the operational dashboard at: https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=CloudOpsAI

Key Metrics

  • Lambda success rate
  • AI decision accuracy
  • Remediation effectiveness
  • Cost per incident

Backup and Recovery

Configuration Backup

# Backup YAML rules
aws s3 sync s3://cloudopsai-config/ backup/

# Backup DynamoDB
aws dynamodb create-backup \
  --table-name cloudopsai-incidents \
  --backup-name "backup-$(date +%Y%m%d)"

Security Operations

Regular Tasks

  • Rotate IAM access keys
  • Review VPC security groups
  • Check CloudTrail logs
  • Update KMS key policy

Compliance

  • Maintain audit logs
  • Review access patterns
  • Update documentation
Clone this wiki locally