A distributed multi-cluster Kubernetes system designed to manage multiple environments, where infrastructure provisioning, application deployment, scaling, and monitoring are fully automated using GitOps workflows, with built-in scalability and observability powered by Terraform, AWS EKS, and ArgoCD.
This project demonstrates how real Dev, Prod, and Control environments can be managed declaratively and reproducibly.
| π¨ Real-World Problem | β What Typically Happens in Teams | β ClusterForge Solution |
|---|---|---|
| π Dev works, Prod breaks | Dev works, Prod breaks Manual configuration differences between clusters. | Terraform modules create identical, reproducible clusters. |
| π βWho changed this?β incidents | kubectl apply manually Cluster state diverges from Git. |
ArgoCD enforces Git as the single source of truth. |
| β³ Traffic drops during deployment | Pods are terminated before new ones are ready; users see downtime | Rolling update strategy with readiness & liveness probes |
| π Application crashes during traffic spike | Static replica count; no autoscaling; manual intervention required | HPA dynamically adjusts replicas based on CPU metrics |
| π Incident debugging takes hours | Teams check only kubectl logs; no metrics visibility |
Prometheus monitoring stack provides real-time metrics and observability |
| ποΈ No one knows how infra was created | Click-ops in AWS console; no documentation; hard to recreate environments | Fully declarative Infrastructure as Code (Terraform) |
| π Over-permissioned IAM roles | Static credentials and broad policies increase security risk | IAM roles with least privilege + OIDC provider integration |
| π₯ Terraform destroy fails midway | AWS resources have hidden dependencies (e.g., NAT β subnets β VPC); incorrect deletion order causes failures and manual cleanup | Dependencies are explicitly handled and validated, ensuring clean and complete teardown of infrastructure |
| π Flat networking causes exposure | All services share same subnet; poor isolation between workloads | Multi-AZ VPC with public/private subnet isolation |
| π¦ Dev accidentally affects Prod | Single cluster used for multiple environments | Dedicated EKS clusters per environment |
| π Monitoring added after outage | Metrics and alerting introduced only after a production incident | Monitoring integrated as a core platform layer |
| π Cluster management chaos | Multiple clusters manually accessed and configured | Central control cluster managing environments via GitOps |
The following represents the folder structure of the ClusterForge Infrastructure Repository, responsible for provisioning networking, security, and multi-environment Kubernetes clusters using Terraform.
clusterforge-infra/
β
βββ modules/ # Reusable Terraform modules
β
β βββ vpc/ # VPC, subnets, routing, NAT, gateways
β β βββ main.tf # Defines networking resources
β β βββ variables.tf # CIDR, AZs, subnet configs
β β βββ outputs.tf # VPC ID, subnet IDs
β
β βββ eks/ # EKS cluster + node groups
β β βββ main.tf # EKS, node groups, IRSA
β β βββ variables.tf # Cluster config inputs
β β βββ outputs.tf # Endpoint, OIDC, node details
β
β βββ iam/ # IAM roles and policies
β βββ main.tf # Roles, policies, OIDC trust
β βββ variables.tf # Role configs
β βββ outputs.tf # Role ARNs
β
βββ main.tf # Root module wiring VPC, IAM, EKS
βββ variables.tf # Global configuration variables
βββ outputs.tf # Exported infrastructure outputs
βββ providers.tf # AWS provider configuration
βββ backend.tf # Remote state (S3 + DynamoDB)
βββ README.md # Project documentation
βββ LICENSE # License file
βββ .gitignore # Ignore local/terraform files
- VPC β provides networking
- IAM β provides permissions
- EKS β uses both to create clusters
π Modules are separate but connected through inputs/outputs.
The application deployment layer of this project β including Kubernetes manifests, ArgoCD configuration, and the multi-environment GitOps workflow β is maintained in a separate repository.
π ClusterForge GitOps Repository:
https://github.com/immanas/clusterforge-gitops
- clusterforge-infra β builds infrastructure
- clusterforge-gitops β deploys applications
π Infra creates the platform, GitOps manages what runs on it.
| β What This Project IS | β What This Project is NOT |
|---|---|
| Multi-Environment Kubernetes Platform β Dev, Prod, and Control clusters running on Amazon EKS | Not a single-cluster Kubernetes demo |
| Infrastructure as Code (Terraform) β Fully provisioned VPC, IAM, and EKS using reusable modules | Not a static YAML-only deployment |
| Centralized GitOps Control Plane β ArgoCD runs in control cluster and deploys apps to dev/prod clusters | Not a CI/CD-only showcase without real infrastructure |
| π Production-Grade Deployment β NGINX with rolling updates, probes, and health checks | Not a local Minikube experiment |
| π Auto-Scaling Enabled β Horizontal Pod Autoscaler (HPA) based on CPU metrics | Not a toy monitoring setup without scaling validation |
| π Observability Integrated β Metrics Server + Prometheus + Grafana | Not a slide-based architecture without live proof |
| π Secure by Design β IAM roles, OIDC (IRSA), private subnets, controlled networking | |
| π§± Modular & Scalable Architecture β Designed for real-world extensibility |
This project demonstrates a real, deployable, multi-cluster cloud-native platform β built and validated end-to-end.
This project combines Infrastructure as Code, Kubernetes orchestration, and GitOps-driven deployment to build a production-style multi-cluster platform.
- AWS (ap-south-1) β Primary cloud provider
- Amazon EKS β Managed Kubernetes control plane
- Amazon VPC β Custom networking (public/private subnets, NAT, IGW),
- Flow:Internet β Public Subnet β NAT β Private Subnet β EKS Nodes β Pods
IIAM Roles for Service Accounts (IRSA) allows Kubernetes pods to securely access AWS services. Instead of storing AWS credentials inside containers:
- Each pod is linked to an IAM role
- AWS verifies identity using OIDC (OpenID Connect) π Flow: Pod β OIDC identity β IAM Role β AWS service
- KMS β Encryption at rest for cluster secrets
- CloudWatch β Control plane logging
- S3 + DynamoDB β Terraform remote backend & state locking
- Terraform (>= 1.5) β Modular infrastructure provisioning
- Reusable modules:
vpc,eks,iam - Remote state management for safe multi-user workflows
- Kubernetes (EKS 1.29+)
- Managed Node Groups
- Horizontal Pod Autoscaler (HPA)
- Rolling updates & self-healing deployments
ArgoCD acts as the GitOps controller running inside the control cluster.
Flow:
- ArgoCD watches the GitOps repository
- Detects changes in Kubernetes manifests
- Connects to target clusters (dev / prod) using stored cluster credentials
- Applies changes automatically and keeps clusters in sync with Git
π This ensures Git is always the single source of truth for deployments
- Docker β Containerized Nginx application
- Kubernetes manifests:
- Deployment
- Service
- HPA
- Namespace
- kubectl
Public & private subnets across multiple AZs with proper routing.
EKS-managed EC2 instances running in private subnets.
Applications synced and healthy across dev & prod clusters.
End-to-End Flow:
Runtime Request Flow
- User sends request to Kubernetes Service
- Service forwards traffic to one of the running Pods
- Pod processes the request (Nginx container)
- Metrics Server collects CPU usage
- HPA evaluates metrics:
- If CPU > threshold β increase replicas
- If CPU normal β maintain or reduce replicas
π This creates a self-healing and auto-scaling system without manual intervention.
Why This Design?
- Clear separation of infra and app layers.
- Multi-environment isolation.
- Git-driven declarative deployment.
- Production-aligned Kubernetes architecture.
Failure Scenarios
- Node failure β Pods rescheduled automatically.
- Pod crash β Kubernetes self-healing restarts container.
- High traffic β HPA scales replicas.
- Terraform drift β Reconciliation via
terraform apply.
Security Considerations
- Private subnets for worker nodes.
- IAM least-privilege roles.
- IRSA for workload identity.
- Encrypted EKS secrets via KMS.
- Remote state locking via DynamoDB.
Scalability & Performance
- Managed node group scaling.
- Horizontal Pod Autoscaler.
- Multi-AZ subnet distribution.
- Stateless application design.
Prerequisites:
- EKS clusters (control, dev, prod) already provisioned via
clusterforge-infra- ArgoCD installed on the control cluster
- kubectl configured
- ArgoCD CLI installed
Switch to Control Cluster
kubectl config use-context <control-cluster-context>2οΈβ£ Apply ArgoCD Applications
Deploy Dev and Prod applications:
kubectl apply -f environments/dev/app.yaml
kubectl apply -f environments/prod/app.yaml
3οΈβ£ Verify ArgoCD Sync
- argocd app list
- You should see:
nginx-dev β Synced & Healthy
nginx-prod β Synced & Healthy
4οΈβ£ Validate Deployment in Target Cluster
Switch to dev or prod cluster:
kubectl config use-context <dev-cluster-context>
kubectl get pods -n nginx-app
kubectl get hpa -n nginx-app
You should see:
Running NGINX pods
HPA configured and active
Trade-offs & Decisions
- Chose EKS over self-managed Kubernetes
β Offloads control plane management, upgrades, and HA complexity to AWS.
β Focus stays on platform design and workload reliability instead of etcd and master node operations. - Separated infra and GitOps repositories
β Enforces clear ownership boundaries between platform and application layers.
β Reduces blast radius and aligns with real-world DevOps team structures. - Used Managed Node Groups
β Simplifies lifecycle management (auto-repair, scaling, upgrades).
β Avoids operational overhead of maintaining custom worker AMIs and autoscaling groups. - Prioritized Infrastructure as Code over manual console setup
β Guarantees reproducibility and auditability.
β Eliminates configuration drift and enables safe teardown/rebuild cycles.
Explicit Limitations
- No production-grade ingress controller (for simplicity).
- No service mesh implemented.
- Monitoring stack optional (not hardened for production).
ClusterForge is an open-source initiative, and we welcome contributions from developers, data scientists, cloud engineers, and Devops enthusiasts!
- Add production-grade Ingress + ALB.
- Integrate full Prometheus/Grafana monitoring.
- Implement CI validation for Terraform plans.
- Add cost optimization policies.
- Introduce blue/green deployment strategy.
- π΄ Fork the repo
- π¦ Create a new feature branch:
git checkout -b feature-name - β Make your changes and test them
- π¬ Submit a pull request describing your enhancement
- π€ Let's Build This Together! Made with π by Manas Gantait




