Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove autoscaler permissions from worker role #34

Merged
merged 1 commit into from
Sep 18, 2020
Merged

Remove autoscaler permissions from worker role #34

merged 1 commit into from
Sep 18, 2020

Conversation

Nuru
Copy link
Contributor

@Nuru Nuru commented Sep 18, 2020

Potentially breaking changes

Terraform 0.13.3 or later required

This release requires Terraform 0.13.3 or later because it is affected by these bugs that are fixed in 0.13.3:

It remains possibly affected by hashicorp/terraform#25631 but we hope we have worked around that for now.

Securing the Cluster Autoscaler

Previously, setting enable_cluster_autoscaler = true turned on tagging sufficient for the Kubernetes Cluster Autoscaler to discover and manage the node group, and also added a policy to the node group worker role that allowed the workers to perform the autoscaling function. Since pods by default use the EC2 instance role, which in EKS node groups is the node group worker role, this allowed the Kubernetes Cluster Autoscaler to work from any node, but also allowed any rogue pod to perform autoscaling actions.

With this release, enable_cluster_autoscaler is deprecated and its functions are replaced with 2 new variables:

  • cluster_autoscaler_enabled, when true, causes this module to perform the labeling and tagging needed for the Kubernetes Cluster Autoscaler to discover and manage the node group
  • worker_role_autoscale_iam_enabled, when true, causes this module to add the IAM policy to the worker IAM role to enable the workers (and by default, any pods running on the workers) to perform autoscaling operations

Going forward, we recommend not using enable_cluster_autoscaler (it will eventually be removed) and leaving worker_role_autoscale_iam_enabled at its default value of false. If you want to use the Kubernetes Cluster Autoscaler, set cluster_autoscaler_enabled = true and use EKS IAM roles for service accounts to give the Cluster Autoscaler service account IAM permissions to perform autoscaling operations. Our Terraform module terraform-aws-eks-iam-role is available to help with this.

Known issues

There remains a bug in amazon-vpc-cni-k8s (a.k.a. amazon-k8s-cni:v1.6.3) where after deleting a node group, some ENIs for that node group may be left behind. If any are left behind, they will prevent any security group they are attached to (such as the security group created by this module to enable remote SSH access) from being deleted, and Terrform will relay an error message like

Error deleting security group: DependencyViolation: resource sg-067899abcdef01234 has a dependent object

There is a feature request that should resolve this issue for our use case. Meanwhile the good news is that the trigger is deleting a security group, which does not often happen, and even when the security group is deleted we have been able to reduce the chance the problem occurs. When it does happen, there are some workarounds:

  1. Since this is a known problem, there are some processes at Amazon that attempt to clean up these abandoned ENIs. We have seen them disappear after 1-2 hours, after which Terraform apply will succeed in deleting the security group.
  2. You can find and delete the dangling ENIs on your own. We have observed the dangling ENIs to have AWS tags of the form Name=node.k8s.amazonaws.com/instance_id,Value=<instance-id> where <instance-id> is the EC2 instance ID of the instance the ENI is supposed to be associated with. A cleanup script could fine ENIs with state = AVAILABLE and tagged as belonging to instances that are terminated or do not exist and delete them.
  3. You can also delete the security group through the AWS Web Console, which will guide you to the other resources that need to be deleted in order for the security group to be free to delete. The security group created by this module to enable SSH access with have a name ending with -remoteAccess so you can easily identify it. If you delete it inappropriately, Terraform will re-create it on the next plan/apply cycle, so this is a relatively save operation.

Fortunately, this should be a rare occurrence, and we hope it will be definitively fixed in the next few months.

Reminder from 0.11.0: create_before_destroy

Starting with 0.11.0 you have the option of enabling create_before_destroy behavior for the node groups. We recommend doing it, as destroying a node group before creating its replacement can result in a significant cluster outage, but it is not without its downsides. Read the description and discussion in PR #31 for more details .

what

why

Error: Provider produced inconsistent final plan

When expanding the plan for
module.region_node_group["main"].module.node_group["us-west-2b"].module.eks_node_group.random_pet.cbd[0]
to include new values learned so far during apply, provider
"registry.terraform.io/hashicorp/random" produced an invalid new value for
.keepers["source_security_group_ids"]: was cty.StringVal(""), but now
cty.StringVal("sg-0465427f44089a888").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

@Nuru Nuru added terraform/0.13 Module requires Terraform 0.13 or later bugfix Change that restores intended behavior labels Sep 18, 2020
@Nuru Nuru requested review from a team as code owners September 18, 2020 06:20
@Nuru Nuru requested review from adamcrews, nitrocode, aknysh, danjbh and osterman and removed request for a team, adamcrews and nitrocode September 18, 2020 06:20
@Nuru
Copy link
Contributor Author

Nuru commented Sep 18, 2020

/test all

@osterman
Copy link
Member

Btw, warning:
image

@Nuru
Copy link
Contributor Author

Nuru commented Sep 18, 2020

/test all

@Nuru Nuru merged commit 6d012b4 into master Sep 18, 2020
@Nuru Nuru deleted the cycle3 branch September 18, 2020 19:22
@Nuru Nuru added the enhancement New feature or request label Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Change that restores intended behavior enhancement New feature or request terraform/0.13 Module requires Terraform 0.13 or later
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants