Remove autoscaler permissions from worker role #34
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Potentially breaking changes
Terraform 0.13.3 or later required
This release requires Terraform 0.13.3 or later because it is affected by these bugs that are fixed in 0.13.3:
It remains possibly affected by hashicorp/terraform#25631 but we hope we have worked around that for now.
Securing the Cluster Autoscaler
Previously, setting
enable_cluster_autoscaler = true
turned on tagging sufficient for the Kubernetes Cluster Autoscaler to discover and manage the node group, and also added a policy to the node group worker role that allowed the workers to perform the autoscaling function. Since pods by default use the EC2 instance role, which in EKS node groups is the node group worker role, this allowed the Kubernetes Cluster Autoscaler to work from any node, but also allowed any rogue pod to perform autoscaling actions.With this release,
enable_cluster_autoscaler
is deprecated and its functions are replaced with 2 new variables:cluster_autoscaler_enabled
, whentrue
, causes this module to perform the labeling and tagging needed for the Kubernetes Cluster Autoscaler to discover and manage the node groupworker_role_autoscale_iam_enabled
, whentrue
, causes this module to add the IAM policy to the worker IAM role to enable the workers (and by default, any pods running on the workers) to perform autoscaling operationsGoing forward, we recommend not using
enable_cluster_autoscaler
(it will eventually be removed) and leavingworker_role_autoscale_iam_enabled
at its default value offalse
. If you want to use the Kubernetes Cluster Autoscaler, setcluster_autoscaler_enabled = true
and use EKS IAM roles for service accounts to give the Cluster Autoscaler service account IAM permissions to perform autoscaling operations. Our Terraform module terraform-aws-eks-iam-role is available to help with this.Known issues
There remains a bug in amazon-vpc-cni-k8s (a.k.a.
amazon-k8s-cni:v1.6.3
) where after deleting a node group, some ENIs for that node group may be left behind. If any are left behind, they will prevent any security group they are attached to (such as the security group created by this module to enable remote SSH access) from being deleted, and Terrform will relay an error message likeThere is a feature request that should resolve this issue for our use case. Meanwhile the good news is that the trigger is deleting a security group, which does not often happen, and even when the security group is deleted we have been able to reduce the chance the problem occurs. When it does happen, there are some workarounds:
apply
will succeed in deleting the security group.Name=node.k8s.amazonaws.com/instance_id,Value=<instance-id>
where<instance-id>
is the EC2 instance ID of the instance the ENI is supposed to be associated with. A cleanup script could fine ENIs with state = AVAILABLE and tagged as belonging to instances that are terminated or do not exist and delete them.-remoteAccess
so you can easily identify it. If you delete it inappropriately, Terraform will re-create it on the next plan/apply cycle, so this is a relatively save operation.Fortunately, this should be a rare occurrence, and we hope it will be definitively fixed in the next few months.
Reminder from 0.11.0:
create_before_destroy
Starting with 0.11.0 you have the option of enabling
create_before_destroy
behavior for the node groups. We recommend doing it, as destroying a node group before creating its replacement can result in a significant cluster outage, but it is not without its downsides. Read the description and discussion in PR #31 for more details .what
random_pet
"keepers"why
_enabled
.terraform apply
would fail with error like