- Motivation & Goals
- Application Overview
- AWS Infrastructure Architecture
- Environment Configuration Differences
- CI/CD Workflows (GitHub Actions)
This project was built as a practical, hands-on learning experience to master Terraform and AWS for cloud infrastructure deployment. The primary goals are:
- To learn Infrastructure as Code (IaC) by defining all cloud resources using Terraform.
- To create reusable Terraform modules to quickly spin up simple yet production-ready AWS infrastructure that ensures high availability.
The application deployed to ECS is a minimal NestJS API.
| Endpoint | Description |
|---|---|
/ (GET) |
The primary endpoint, which returns "Hello World!". |
/health (GET) |
The health check endpoint, which returns true. It logs a unique instance ID for diagnostics and is used by the ALB's Target Group to monitor container health. |
The application's Dockerfile uses the official Node.js Alpine image, installs pnpm globally, and runs the compiled application via pnpm start:prod on port 3000.
The deployment creates a highly available architecture using interconnected AWS services. A key AWS service is AWS Elastic Container Service (ECS), configured with auto-scaling and a secure HTTPS endpoint.
All resources are consistently tagged with Project = high-availability-app for easy location and management via AWS Resource Explorer.
The project utilizes a clear separation between Root Modules (deployment stages) and Child Modules (reusable components) to enforce the Single Responsibility Principle (SRP) and maximize reusability.
- Child Modules (
infra/modules/*): These define single, reusable components of infrastructure (e.g.,ecr,alb,ecs_cluster,ssl).- Goal: High reusability and Separation of Concerns (SoC). A team can easily reuse the
albmodule in another project without needing to copy VPC or ECS code. - Implementation: They rely solely on input variables (like
var.vpc_id) and return outputs (likealb_dns_name).
- Goal: High reusability and Separation of Concerns (SoC). A team can easily reuse the
- Root Modules (
infra/deployment/*): These define the environment-specific deployment stages (e.g.,prod/vpc,prod/ecs_service).- Goal: Orchestration and Configuration. They stitch the child modules together, using
data "terraform_remote_state"to read outputs from previous stages (like VPC ID) and pass environment-specific values (likeprodscaling limits) to the child modules.
- Goal: Orchestration and Configuration. They stitch the child modules together, using
| AWS Service | Role in the Architecture |
|---|---|
| Virtual Private Cloud (VPC) | Provides an isolated network, defining public and private subnets across multiple AZs for high availability. NAT Gateways enable private resources to access the internet. |
| Elastic Container Registry (ECR) | A private Docker registry storing application container images. Uses priority rules (Rule 1: untagged, Rule 2: tagged) to aggressively expire images while safely retaining a configurable count of environment-tagged (dev-, prod-) images. |
| ECS Cluster | The compute capacity (EC2 instances) running within private subnets. It uses an Auto Scaling Group (ASG) and a Capacity Provider who tells the ECS how to manage the ASG scaling. A critical element in the cluster is the ECS Control Plane, the central component that coordinates containers (i.e., tasks) and ensures cluster wellbeing. Furthermore, each EC2 instance includes an ECS Agent that reports containers health to the Control Plane. |
| ECS Service | The deployment mechanism that defines how many copies of a specific task definition should run on a given ECS cluster, automatically maintaining that desired count and integrating with an Elastic Load Balancer for traffic distribution. |
| ECS Task | The fundamental unit of deployment (the running container). Deployed onto EC2 instances, tasks receive a private IP via awsvpc networking and are registered with the ALB Target Group. |
| Application Load Balancer (ALB) | Distributes incoming traffic. It listens on Port 443 (HTTPS) and redirects all Port 80 (HTTP) traffic to HTTPS (301 Permanent Redirect). The ALB forwards traffic to an ALB Target Group, which acts as the dynamic list of healthy ECS Tasks. |
| Route 53 & ACM | The Route 53 Hosted Zone manages DNS records. AWS Certificate Manager (ACM) provides and validates the SSL certificate, which is attached to the ALB's HTTPS listener to enable secure communication. |
The infrastructure uses a robust, layered security model based on IAM roles (for access control between AWS services) and Security Groups (for network traffic filtering).
- ECS EC2 Instance Role (
ecs_instance_role): This role is assumed by the EC2 container instances. Its permissions allow the instance to:- Join the Cluster: Register itself as a Container Instance with the ECS Control Plane (
ecs:RegisterContainerInstance). - Pull Images: Get authorization tokens from ECR to pull Docker images (
ecr:GetAuthorizationToken,ecr:BatchGetImage, etc.). - Logging: Write operational logs to AWS CloudWatch Logs.
- SSM Access: Includes the managed policy
AmazonSSMManagedInstanceCoreto enable secure remote access to the EC2 instances via AWS Session Manager (SSM).
- Join the Cluster: Register itself as a Container Instance with the ECS Control Plane (
- ECS Task Execution Role (
ecs_task_execution_role): This role is assumed by the ECS service itself. It provides the permissions needed for the ECS agent to perform actions on behalf of your tasks, specifically:- Image Pull: Pull the required Docker image from ECR.
- Log Management: Write container application logs to the designated CloudWatch Log Group (
my-app-lg).
The network model is secured by isolating the application tier within the private subnets and restricting access based on the least-privilege principle.
- Application Load Balancer SG (
alb-sg):- Ingress: Allows inbound traffic from
0.0.0.0/0(the entire internet) on Port 80 (HTTP) and Port 443 (HTTPS). This is the public entry point. - Egress: Allows all outbound traffic (
0.0.0.0/0on all ports/protocols). This is critical as it enables the ALB to initiate the connection to the backend ECS tasks.
- Ingress: Allows inbound traffic from
- ECS Tasks SG (
ecs-tasks-sg):- Ingress: Highly Restricted. Allows incoming traffic only on the application's container port (
3000) and only when the source is thealb-sg. This prevents direct internet access to the containers. - Egress: Allows all outbound traffic (
0.0.0.0/0), enabling tasks to pull dependencies and access other necessary AWS services like the NAT Gateway.
- Ingress: Highly Restricted. Allows incoming traffic only on the application's container port (
- ECS Cluster SG (
cluster-sg): This is associated with the underlying EC2 instances. It ensures the instances themselves can communicate and perform necessary management functions.- Egress: Allows all outbound traffic (
0.0.0.0/0) for tasks like instance patching, running the ECS Agent, and pulling images.
- Egress: Allows all outbound traffic (
The Terraform code supports configuration differences between dev and prod environments, driven by environment-specific variable lookups.
| Setting | dev Value |
prod Value |
Motivation |
|---|---|---|---|
| VPC NAT Gateway | true (Single) |
false (Multiple) |
Cost-saving in dev; multiple NAT GWs in prod ensure high-availability. |
| ECR Tagged Image Retention | Retains max 3 tagged images. | Retains max 10 tagged images. | Minimizes dev ECR size; keeps a deeper history for production rollbacks. Untagged images are cleaned up aggressively in both environments. |
| Min/Max EC2 Instances (ASG) | 1 / 2 | 2 / 4 | Smaller, cheaper cluster in dev; larger cluster in prod for baseline capacity and scaling safety. |
| ECS Cluster Max Utilisation | 100 | 75 | Allows dev EC2 hosts to run at full capacity; 75% in prod provides a buffer for immediate scaling and stability. |
| EC2 Scale-In Protection (ASG) | false |
true |
Disabled in dev to allow quick teardown; Enabled in prod to prevent the ASG from terminating instances currently hosting tasks. |
| ECS Task Placement Strategy | binpack:cpu |
spread:az, then spread:instanceId |
Cost optimization (place tasks on the fewest possible instances) in dev; Maximized fault tolerance (against single-instance failures) in prod. |
| ALB Deletion Protection | false |
true |
Prevents accidental deletion of the load balancer in prod. |
Route53 force_destroy |
true |
false |
Allows quick cleanup in dev; Protects the production domain from accidental deletion in prod. |
All infrastructure changes are managed via GitHub Actions (GHA) workflows. The deployment is split into initial setup and main deployment due to dependencies (Route53/S3 must exist before ACM/Terraform state can use them).
To deploy this project, you must own a domain name accessible through an SSL certificate (e.g., https://example.com) and perform manual DNS updates.
- Execute
deploy_hosted_zone.yaml(Manual Trigger): This job calls a reusable workflow to deploy the remote Terraform state S3 bucket (josumartinez-terraform-state-bucket) and then creates the Route53 Hosted Zone. - Manual Action: After the job succeeds, go to your domain hosting provider and update the DNS name servers to the ones provided by the new Route 53 Hosted Zone.
- Wait: Wait for DNS propagation to complete. The ACM certificate validation depends on this.
Once the DNS is propagated, run the main deployment workflow (triggered on push to main):
deploy_aws_infra.yaml: This workflow executes the deployment in a dependency-aware order:deploy-ecr: Creates the ECR repository.retrieve-ssl: Requests and validates the ACM certificate.build-and-push-app-docker-image-to-ecr: Builds the NestJS app and pushes the Docker image (tagged with${{ env.ENVIRONMENT }}-${{ github.sha }}) to ECR.deploy-vpc: Creates the VPC, subnets, and NAT Gateways.deploy-ecs-cluster: Creates the ECS Cluster, IAM roles, and the ASG Launch Template/Capacity Provider.deploy-alb: Creates the Application Load Balancer and its listeners (HTTPS + HTTP Redirect).deploy-ecs-service: Creates the ECS Task Definition and Service, linking to the ALB Target Group and configuring Task Auto Scaling.deploy-routing: Creates the Route 53 A records for the root andwwwdomains, pointing to the ALB.
Cleanup is also performed in an ordered, two-step workflow (both manually triggered: workflow_dispatch):
-
destroy_aws_infra.yaml: This workflow destroys the application and its core services first:destroy-ecs-service: Scales the ECS service down to 0 tasks and waits for stability, then destroys the ECS service and its resources.destroy-routing,destroy-alb,destroy-ssl,destroy-ecs-cluster: Destroys resources in the reverse order of deployment.destroy-ecr: Crucially, it first runs an AWS CLI command to delete all images from the repository and then destroys the ECR repository resource.destroy-vpc: Destroys the VPC and networking components.
-
destroy_hosted_zone.yaml: This performs the final cleanup:destroy-hosted-zone: Destroys the Route 53 Hosted Zone.destroy-terraform-state-bucket: It first deletes all objects (aws s3 rm --recursive) and then uses Terraform to destroy the empty S3 state bucket.