Skip to content

Conversation

@aknysh
Copy link
Member

@aknysh aknysh commented Nov 2, 2025

what

  • Add ability to configure different numbers of public and private subnets per Availability Zone independently
  • Add controlled NAT Gateway placement by subnet index to reduce costs
  • Add intuitive NAT Gateway placement by subnet name for better usability
  • Fix critical NAT Gateway placement bug causing wrong AZ distribution
  • Fix cross-AZ routing issue where private subnets routed to NATs in different AZs
  • Add comprehensive examples demonstrating cost-optimized and high-availability configurations
  • Add full test coverage with Terratest for all new features
  • Maintain 100% backward compatibility with existing configurations

why

User Pain Points:

  • Users were forced to create equal numbers of public and private subnets, even when workloads didn't require it
  • NAT Gateways were created in every public subnet, resulting in unnecessarily high AWS costs (~$32/month per NAT)
  • No control over which public subnets received NAT Gateways
  • Index-based configuration was not intuitive for users who assigned names to subnets
  • Critical bugs caused NAT Gateways to be placed in wrong AZs and private subnets to route across AZ boundaries

Business Impact:

  • Cost Optimization: Reducing from 6 NATs to 3 NATs saves $96/month (50% reduction)
  • Flexibility: Users can now match subnet configuration to their actual workload requirements
  • Reliability: Fixes ensure NAT Gateways are correctly distributed across AZs and routing stays within same AZ
  • Usability: Name-based placement is more intuitive and maintainable than index-based placement

Key Features:

  1. Separate Public/Private Subnet Counts: New variables public_subnets_per_az_count, public_subnets_per_az_names, private_subnets_per_az_count, private_subnets_per_az_names allow independent control while falling back to original variables for backward compatibility

  2. Controlled NAT Placement by Index: Variable nat_gateway_public_subnet_indices (default [0]) specifies which subnet position(s) in each AZ receive NAT Gateways, enabling cost optimization

  3. Named NAT Placement: Variable nat_gateway_public_subnet_names allows intuitive placement like ["loadbalancer"] instead of remembering indices

  4. Bug Fixes: Corrected NAT Gateway global index calculation and route table mapping to ensure proper AZ distribution and same-AZ routing

Examples Included:

  • examples/separate-public-private-subnets/: Cost-optimized with 1 NAT per AZ (~$110/month)
  • examples/redundant-nat-gateways/: High-availability with 2 NATs per AZ (~$140/month)

Test Coverage:

  • Full Terratest coverage for both examples
  • Tests for name-based and index-based NAT placement
  • Tests for disabled state (no resources created)
  • Verification of all outputs, subnet counts, NAT counts, and route table mappings

references

  • Comprehensive PRD: docs/prd/separate-public-private-subnets-and-nat-placement.md

@aknysh aknysh added the major Breaking changes (or first stable release) label Nov 2, 2025
@aknysh aknysh requested review from a team as code owners November 2, 2025 02:08
@aknysh aknysh requested review from hans-d and nitrocode November 2, 2025 02:08
@aknysh aknysh self-assigned this Nov 2, 2025
@aknysh aknysh removed request for hans-d and nitrocode November 2, 2025 02:10
aknysh and others added 7 commits November 1, 2025 22:25
Incorporate feature from PR #225 to expose NAT Gateway IDs in the subnet stats outputs.

- Add `nat_gateway_id` field to `named_private_subnets_stats_map` (maps to the NAT Gateway that the private subnet routes to for egress)
- Add `nat_gateway_id` field to `named_public_subnets_stats_map` (maps to the NAT Gateway in that public subnet, if any)
- Create helper locals to correctly map subnets to NAT Gateways
- Update output descriptions to reflect the new fourth field

This makes the subnet stats more complete and enables downstream components to reference NAT Gateway IDs when needed (e.g., network firewall routing configurations).

Implementation correctly handles our new NAT placement features:
- Works with index-based NAT placement (`nat_gateway_public_subnet_indices`)
- Works with name-based NAT placement (`nat_gateway_public_subnet_names`)
- Private subnets correctly map to the NAT they route to (using existing `private_route_table_to_nat_map`)
- Public subnets correctly identify which ones contain NAT Gateways

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update deprecated `vpc = true` to `domain = "vpc"` in aws_eip resource.

The `vpc` argument was deprecated in AWS Provider v5.0 and removed completely.
The new syntax uses `domain = "vpc"` instead.

This fixes test failures where Terraform validation fails with:
"Error: Unsupported argument - An argument named 'vpc' is not expected here"

Fixes the same issue that caused PR #225 tests to fail.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update cloudposse/vpc/aws module from v2.0.0 to v3.0.0 across all example configurations.

Changes:
- examples/complete/main.tf
- examples/existing-ips/main.tf
- examples/multiple-subnets-per-az/main.tf
- examples/nacls/main.tf
- examples/redundant-nat-gateways/main.tf
- examples/separate-public-private-subnets/main.tf

v3.0.0 includes:
- Updates to Terraform AWS provider v6 compatibility
- Enhanced VPC endpoint support
- Bug fixes and improvements

This ensures all examples use the latest stable VPC module version with AWS Provider v6 support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhanced PRD and README with comprehensive documentation of the NAT Gateway
ID mapping feature and AWS Provider compatibility updates.

Changes:

**PRD Updates (docs/prd/separate-public-private-subnets-and-nat-placement.md):**
- Updated executive summary to include 4th feature
- Added Feature #4: NAT Gateway ID Exposure in Subnet Stats
  - Detailed problem, solution, and implementation
  - Example output structures for both private and public subnet stats
  - Benefits and use cases (network firewall routing)
- Added Bug Fix #3: AWS Provider v5+ Compatibility
  - Documents EIP `vpc = true` → `domain = "vpc"` migration
  - Notes VPC module updates to v3.0.0

**README.yaml Updates:**
- Added new section "NAT Gateway ID References in Outputs"
- Detailed explanation of `named_private_subnets_stats_map` structure (4 fields)
- Detailed explanation of `named_public_subnets_stats_map` structure (4 fields)
- Real-world use case example: network firewall routing configuration
- Shows how to extract NAT Gateway IDs from stats maps

**README.md (Generated):**
- Auto-generated from README.yaml using `atmos readme`
- Includes all new documentation sections
- Updated output descriptions to reflect 4-field structure

The NAT Gateway ID mapping enables downstream components (network firewalls,
routing policies) to reference NAT Gateways associated with subnets without
manual mapping or additional data sources.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Standardize AWS Provider version constraints across the module and all examples
to ensure compatibility with dependencies and features.

Changes:

**Main Module (versions.tf):**
- Updated from `>= 3.71.0` to `>= 5.0`
- Rationale: AWS Provider v5 introduced `domain = "vpc"` for EIP resources
- Module is compatible with v5+ (main code uses neither `vpc` nor `domain` attributes for broad compatibility)
- Documented AWS Provider v5+ compatibility in PRD

**All Examples (examples/*/versions.tf):**
- Updated from `>= 3.71.0` or `>= 4.0` to `>= 6.0`
- Rationale: All examples use cloudposse/vpc/aws v3.0.0 which requires AWS Provider v6
- VPC module v3.0.0 released September 12, 2025 with AWS Provider v6 support
- Examples affected:
  - examples/complete/versions.tf
  - examples/existing-ips/versions.tf (also uses `domain = "vpc"`)
  - examples/multiple-subnets-per-az/versions.tf
  - examples/nacls/versions.tf
  - examples/redundant-nat-gateways/versions.tf
  - examples/separate-public-private-subnets/versions.tf

**Version Matrix:**
- Module core: AWS Provider >= 5.0 (broadest compatibility)
- Examples: AWS Provider >= 6.0 (required by VPC module dependency)

This ensures all examples work correctly with their dependencies and prevents
version conflict errors during terraform init.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update test dependencies to their latest stable releases for improved
security, performance, and compatibility.

**Go Version:**
- Updated from Go 1.24 to Go 1.25
- Updated toolchain from go1.24.0 to go1.25.0

**Major Dependency Updates:**

**Testing Framework:**
- terratest: v0.41.23 → v0.52.0
- testify: v1.8.2 → v1.11.1

**Kubernetes:**
- k8s.io/apimachinery: v0.20.6 → v0.34.0
- k8s.io/api: v0.20.6 → v0.34.0
- k8s.io/client-go: v0.20.6 → v0.34.0
- k8s.io/klog/v2: v2.4.0 → v2.130.1

**AWS SDK v2 (new):**
- Migrated to AWS SDK v2 packages
- Added support for multiple AWS services (EC2, S3, IAM, RDS, etc.)

**HashiCorp:**
- go-getter: v1.7.1 → v1.8.3 + v2.2.3
- hcl/v2: v2.9.1 → v2.24.0
- terraform-json: v0.13.0 → v0.27.2
- go-version: v1.6.0 → v1.7.0

**Google Cloud:**
- Updated all google.golang.org packages to latest versions
- grpc: v1.56.3 → v1.76.0
- protobuf: v1.30.0 → v1.36.10

**golang.org/x packages:**
- crypto: v0.1.0 → v0.43.0
- net: v0.9.0 → v0.46.0
- oauth2: v0.7.0 → v0.32.0
- sys: v0.7.0 → v0.37.0
- text: v0.9.0 → v0.30.0
- tools: v0.6.0 → v0.38.0

**Other Notable Updates:**
- klauspost/compress: v1.15.11 → v1.18.1
- mattn/go-zglob: v0.0.2 → v0.6
- ulikunitz/xz: v0.5.10 → v0.5.15
- zclconf/go-cty: v1.9.1 → v1.17.0

All dependencies updated using `go get -u ./...` and cleaned with `go mod tidy`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove unused local variable `subnet_az_count` that was causing tflint failure.

This variable was a remnant from the refactoring where we separated public
and private subnet counts. It was declared but never used anywhere in the code.

The variable was calculating `max(local.public_subnet_az_count, local.private_subnet_az_count)`
but this value is not needed since we now handle public and private subnets
independently using:
- `local.public_subnet_az_count` - for public subnet operations
- `local.private_subnet_az_count` - for private subnet operations
- `local.vpc_az_count` - for NAT gateways (one per AZ)

Fixes tflint warning:
  main.tf:86:3: warning: local.subnet_az_count is declared but not used

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@aknysh
Copy link
Member Author

aknysh commented Nov 2, 2025

/terratest

…ble to examples

Add the nat_gateway_public_subnet_indices variable to both new example configurations
to fix test failures where the test was trying to pass this variable but it wasn't
declared in the example's variables.tf.

**Test Failure Fixed**:
TestExamplesSeparatePublicPrivateSubnetsWithIndices was failing with:
"Error: Value for undeclared variable - A variable named 'nat_gateway_public_subnet_indices'
was assigned on the command line, but the root module does not declare a variable of that name."

**Changes**:

1. examples/separate-public-private-subnets/:
   - Added nat_gateway_public_subnet_indices variable with default [0]
   - Made nat_gateway_public_subnet_names nullable
   - Passed nat_gateway_public_subnet_indices to module in main.tf

2. examples/redundant-nat-gateways/:
   - Added nat_gateway_public_subnet_indices variable with default [0, 1]
   - Made nat_gateway_public_subnet_names nullable
   - Passed nat_gateway_public_subnet_indices to module in main.tf

Both examples now support passing either nat_gateway_public_subnet_indices OR
nat_gateway_public_subnet_names, enabling the test to override with index-based
placement when needed.

Fixes test at examples_separate_public_private_subnets_test.go:169

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@aknysh
Copy link
Member Author

aknysh commented Nov 2, 2025

/terratest

@aknysh
Copy link
Member Author

aknysh commented Nov 2, 2025

/terratest

@aknysh
Copy link
Member Author

aknysh commented Nov 2, 2025

/terratest

@aknysh aknysh requested a review from osterman November 2, 2025 17:53
@aknysh aknysh merged commit 6f61890 into main Nov 2, 2025
20 checks passed
@aknysh aknysh deleted the separate-private-public-subnets branch November 2, 2025 22:51
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

These changes were released in v3.0.0.

aknysh added a commit that referenced this pull request Nov 3, 2025
**Problem:**
When max_nats < number of AZs, NAT Gateways are only created in the
first max_nats AZs, but the route table mapping formulas assumed NATs
exist in all AZs, causing "Invalid index" errors.

**Example scenario that failed:**
- 2 AZs, max_nats=1 → Only 1 NAT created (in AZ0)
- 2 private route tables (1 per AZ)
- Mapping produced [0, 1] but only NAT[0] exists
- Error: aws_nat_gateway.default[1] - invalid index

**Root cause:**
The private_route_table_to_nat_map and public_route_table_to_nat_map
formulas calculated: az_index * nats_per_az + subnet_within_az

This works when NATs exist in all AZs, but fails when max_nats limits
NATs to fewer AZs. The formula could generate indices >= nat_count.

**Fix:**
Added modulo operation to clamp results to available NAT indices:
  (floor(i / subnets_per_az_count) * nats_per_az +
   (i % subnets_per_az_count) % nats_per_az) % nat_count

Now route tables in AZs without NATs will wrap around to use NATs
from other AZs (typically AZ0).

**Why wasn't this caught by tests?**
None of the module's examples use the max_nats feature. All examples
either omit max_nats (defaulting to unlimited) or use max_nats >=
number of AZs, so the bug was never triggered.

**Testing:**
This bug was discovered by the aws-vpc component test suite when
using:
- 2 AZs with max_nats: 1
- public_subnets_enabled: true
- NAT Gateway enabled

Fixes: #226
aknysh added a commit that referenced this pull request Nov 4, 2025
* Fix NAT routing when max_nats limits NATs to fewer AZs

**Problem:**
When max_nats < number of AZs, NAT Gateways are only created in the
first max_nats AZs, but the route table mapping formulas assumed NATs
exist in all AZs, causing "Invalid index" errors.

**Example scenario that failed:**
- 2 AZs, max_nats=1 → Only 1 NAT created (in AZ0)
- 2 private route tables (1 per AZ)
- Mapping produced [0, 1] but only NAT[0] exists
- Error: aws_nat_gateway.default[1] - invalid index

**Root cause:**
The private_route_table_to_nat_map and public_route_table_to_nat_map
formulas calculated: az_index * nats_per_az + subnet_within_az

This works when NATs exist in all AZs, but fails when max_nats limits
NATs to fewer AZs. The formula could generate indices >= nat_count.

**Fix:**
Added modulo operation to clamp results to available NAT indices:
  (floor(i / subnets_per_az_count) * nats_per_az +
   (i % subnets_per_az_count) % nats_per_az) % nat_count

Now route tables in AZs without NATs will wrap around to use NATs
from other AZs (typically AZ0).

**Why wasn't this caught by tests?**
None of the module's examples use the max_nats feature. All examples
either omit max_nats (defaulting to unlimited) or use max_nats >=
number of AZs, so the bug was never triggered.

**Testing:**
This bug was discovered by the aws-vpc component test suite when
using:
- 2 AZs with max_nats: 1
- public_subnets_enabled: true
- NAT Gateway enabled

Fixes: #226

* Add test coverage for max_nats feature

**New Example: limited-nat-gateways**
Tests the critical max_nats < num_azs scenario that was previously
untested and contained the bug fixed in commit 3681299.

**What it tests:**
- 3 AZs with max_nats=1 (only 1 NAT Gateway created)
- 3 AZs with max_nats=2 (only 2 NAT Gateways created)
- Verifies that private subnets in AZs without NATs can still route
  to the internet through NATs in other AZs
- Ensures route table mapping correctly wraps around when NATs are
  limited to fewer AZs

**Why this is important:**
This test would have caught the bug before it reached production.
The bug caused "Invalid index" errors when max_nats < num_azs because
route tables in AZs without NATs tried to reference non-existent NAT
Gateway indices.

**Use case documented:**
The example includes comprehensive README explaining when to use
max_nats for cost optimization:
- Dev/test environments: 1 NAT for 3 AZs saves ~$65/month (67%)
- Trade-off: Lower availability vs. cost savings
- When to use and when to avoid this pattern

**Test coverage:**
- TestExamplesLimitedNatGateways: 3 AZs, 1 NAT
- TestExamplesLimitedNatGatewaysTwoNats: 3 AZs, 2 NATs
- TestExamplesLimitedNatGatewaysDisabled: enabled=false check

This brings total test coverage to:
- 7 examples (was 6)
- 11 test functions (was 8)
- ✅ max_nats feature now tested

Closes test coverage gap identified during bug investigation.

* Add comprehensive documentation for max_nats bug fix

**Documentation Added:**

1. **Test Coverage Analysis** (docs/test-coverage-analysis.md)
   - Complete analysis of all 7 examples and 13 test functions
   - Feature coverage matrix showing what's tested vs untested
   - Identified critical gaps: NAT Instance, IPv6/NAT64, custom route tables
   - Recommendations for future test improvements
   - Test execution strategy and best practices

2. **PRD: Fix max_nats Routing Bug** (docs/prd/fix-max-nats-routing.md)
   - Executive summary of the critical bug
   - Detailed problem statement with error examples
   - Root cause analysis and why tests didn't catch it
   - Technical solution design with before/after code
   - Complete testing strategy with 3 new test functions
   - Cost analysis showing 67% savings with max_nats
   - Rollout plan, success metrics, and risk assessment
   - Lessons learned and process improvements

**Key Findings:**

- Bug existed because max_nats had ZERO test coverage
- Module has good test coverage overall (13 tests, 7 examples)
- Several features still lack tests: NAT Instance, IPv6, custom routes
- New limited-nat-gateways example fills critical gap
- Cost optimization: $65/month savings per environment with max_nats=1

**Impact:**

This documentation ensures:
- Future contributors understand test requirements
- Test gaps are visible and prioritized
- Cost optimization features are documented
- Bug investigation process is recorded
- Test coverage improvements have a roadmap

* Enhance PRD with NAT placement diagrams and decision tree

**Major Additions:**

1. **NAT Gateway Placement Behavior Section**
   - Explains the two configuration dimensions: subnet names and max_nats
   - Formula: Total NATs = min(num_azs, max_nats) × num_subnet_names
   - Clarifies that multiple subnet names = multiple NATs per AZ

2. **Visual ASCII Diagrams for 4 Placement Strategies**
   - Strategy 1: Standard (1 NAT per AZ) - $97.20/month
   - Strategy 2: Redundant (2 NATs per AZ) - $194.40/month
   - Strategy 3: Limited (1 NAT total with max_nats) - $32.40/month
   - Strategy 4: Hybrid (2 NATs in 1 AZ) - $64.80/month

   Each diagram shows:
   - Which AZs get NATs
   - Which subnet types get NATs
   - Routing arrows for cross-AZ traffic
   - Monthly cost calculations

3. **Configuration Calculation Table**
   - 6 common configuration scenarios
   - Shows AZs, subnet names, max_nats, total NATs, cost, and use case
   - Helps users quickly find the right configuration

4. **Decision Tree for Choosing NAT Configuration**
   - Production vs non-production paths
   - Maximum availability vs cost optimization
   - Special requirements (failover, single AZ, no NAT)
   - Recommendations with cost and availability trade-offs

5. **Routing Behavior Explanation**
   - Mathematical formula breakdown with examples
   - Shows how modulo operation prevents invalid indices
   - Demonstrates the bug fix in action

6. **Best Practices by Environment Type**
   - Production: NAT per AZ minimum, consider redundancy
   - Development: max_nats=1 for cost savings
   - Staging: Balanced approach with max_nats=2
   - Cost optimization tips

**Why This Matters:**

The original PRD didn't clearly explain that:
- nat_gateway_public_subnet_names can create MULTIPLE NATs per AZ
- max_nats limits which AZs get NATs, not total NAT count
- These two variables multiply together

This caused confusion about cost implications. The new diagrams and
decision tree make it crystal clear how to configure NATs for different
scenarios and understand the cost/availability trade-offs.

**Examples:**
- 3 AZs + names=["lb","web"] + max_nats=3 = 6 NATs ($194/mo)
- 3 AZs + names=["lb"] + max_nats=1 = 1 NAT ($32/mo)
- Understanding this prevents costly misconfigurations

* updates

* Fix test variable override issue for max_nats

Remove hardcoded `max_nats = 1` from fixtures.us-east-2.tfvars to allow
tests to properly override the value via the Vars parameter.

**Problem:**
TestExamplesLimitedNatGatewaysTwoNats was failing because it tried to
override max_nats=2 via Vars, but the fixtures file had max_nats=1
hardcoded. Depending on Terraform's variable precedence and how terratest
constructs the command line, the -var-file value may take precedence over
the -var flag.

**Solution:**
Remove max_nats from fixtures file. Both tests explicitly pass max_nats in
their Vars map:
- TestExamplesLimitedNatGateways passes max_nats: 1
- TestExamplesLimitedNatGatewaysTwoNats passes max_nats: 2

The example's variables.tf provides default = 1 as a fallback.

Fixes the test failure where expected 2 NAT Gateways but got 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* updates

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major Breaking changes (or first stable release)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants