Description
Description
I encountered issues with node group creation and upgrades using the terraform-aws-eks module version v20.31.6. Below are the details:
-
During the initial creation of a node group with the labels configuration, the process took 20 minutes and failed with the error:
NodeCreationFailure: Instances failed to join the Kubernetes cluster
Removing thelabels
configuration allowed the node group to be created successfully. I then re-enabled the labels and ranterraform apply
, which updated the labels successfully. -
While upgrading a node group with
use_latest_ami_release_version = true
, the process took a long time and failed with the error:
NodeCreationFailure: Couldn't proceed with upgrade process as new nodes are not joining the node group.
I suspect this issue might be related to thelabels
configuration, similar to issue 1.
- ✋ I have searched the open/closed issues and my issue is not listed.
Versions
-
Module version:
- v20.31.6
-
Terraform version:
- Terraform v1.5.7 on darwin_arm64
-
Provider version(s):
- hashicorp/aws: 5.84.0
Reproduction Code [Required]
module "eks" {
source = "terraform-aws-modules/eks/aws"
#Develop on git Tags v20.31.6
version = "~> 20.31"
cluster_name = "eks-cluster-test"
cluster_version = 1.31
vpc_id = data.aws_vpc.main.id
subnet_ids = data.aws_subnet.private_subnets[*].id
control_plane_subnet_ids = data.aws_subnet.private_eks_subnets[*].id
cluster_endpoint_public_access = true
cluster_endpoint_public_access_cidrs = var.eks_public_access_cidrs
cluster_service_ipv4_cidr = var.cluster_service_ipv4_cidr
enable_irsa = true
cluster_addons = {
aws-ebs-csi-driver = {
most_recent = true
}
coredns = {
most_recent = true
resolve_conflicts = "OVERWRITE"
timeouts = {
create = "25m"
delete = "10m"
}
}
kube-proxy = {
most_recent = true
resolve_conflicts = "OVERWRITE"
}
vpc-cni = {
most_recent = true
before_compute = true
resolve_conflicts = "OVERWRITE"
configuration_values = jsonencode({
env = {
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
ENI_CONFIG_LABEL_DEF = "topology.kubernetes.io/zone"
ENABLE_PREFIX_DELEGATION = "true"
WARM_PREFIX_TARGET = "1"
}
})
}
}
cluster_upgrade_policy = {
support_type = "STANDARD"
}
eks_managed_node_groups = {
default_al2023_ng = {
name = "al2023-df-ng"
use_latest_ami_release_version = true
ami_type = "AL2023_x86_64_STANDARD"
instance_types = ["c5.large"]
capacity_type = "ON_DEMAND"
subnet_ids = data.aws_subnet.private_subnets[*].id
disk_size = 100
min_size = 1
max_size = 3
desired_size = 2
block_device_mappings = {
xvda = {
device_name = "/dev/xvda"
ebs = {
volume_size = 100
volume_type = "gp3"
iops = 3000
throughput = 150
delete_on_termination = true
}
}
}
cloudinit_pre_nodeadm = [
{
content_type = "application/node.eks.aws"
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
shutdownGracePeriod: 30s
featureGates:
DisableKubeletCloudCredentialProviders: true
EOT
}
]
labels = {
"karpenter.sh/controller" = "true"
}
}
}
enable_cluster_creator_admin_permissions = true
authentication_mode = "API_AND_CONFIG_MAP"
access_entries = {
adminteam = {
principal_arn = iam_role_arn_xxxxx
policy_associations = {
admin_policy = {
policy_arn = "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy"
access_scope = {
namespaces = []
type = "cluster"
}
}
}
}
}
node_security_group_tags = merge(var.tags, {
"karpenter.sh/discovery" = "eks-cluster-test"
})
tags = var.tags
}
Steps to reproduce the behavior:
- Create a node group with the provided configuration.
- Observe failure during the creation process (
NodeCreationFailure
error). - Remove the
labels
configuration and reapply Terraform. - Re-enable
labels
and apply Terraform again (successfully updates labels). - Enable
use_latest_ami_release_version = true
and attempt to upgrade. - Observe the upgrade failure with the error:
NodeCreationFailure: Couldn't proceed with upgrade process as new nodes are not joining the node group.
Expected behavior
- Node group creation should succeed with
labels
configured. - Node group upgrade should proceed successfully when
use_latest_ami_release_version = true
with labels configured.
Actual behavior
-
Node group creation fails when
labels
are configured initially. -
Node group upgrade fails with the error:
NodeCreationFailure: Couldn't proceed with upgrade process as new nodes are not joining the node group.
Additional context
- The issue seems to be resolved temporarily by removing
labels
during the initial creation and reapplying them later. However, the upgrade process remains problematic whenuse_latest_ami_release_version = true
. - I suspect the issue might be related to AMI updates or the interaction between
use_latest_ami_release_version
andlabels
.