Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows node support #139

Merged
merged 15 commits into from
Mar 22, 2023
Merged

Windows node support #139

merged 15 commits into from
Mar 22, 2023

Conversation

ChrisMcKee
Copy link
Contributor

@ChrisMcKee ChrisMcKee commented Jan 3, 2023

what

why

references

Tested

image
image
image
image

module "eks_windows_node_group" {
  # source  = "cloudposse/eks-node-group/aws"
  # version = "2.6.1"
  source = "github.com/ChrisMcKee/terraform-aws-eks-node-group"

  instance_types     = ["t3.large", "t3a.large", "c5.large", "c6i.large", "m6i.large", "r6i.large"]
  subnet_ids         = [data.terraform_remote_state.network.outputs.private_subnets[1]]
  min_size           = 1
  max_size           = 1
  desired_size       = 1
  cluster_name       = module.eks_cluster.eks_cluster_id
  kubernetes_version = var.kubernetes_version == null || var.kubernetes_version == "" ? [] : [var.kubernetes_version]
  kubernetes_labels  = var.labels

  ami_type = "WINDOWS_CORE_2019_x86_64"

  update_config = [{ max_unavailable = 1 }]

  capacity_type = "SPOT"

  kubernetes_taints = [{
    key    = "OS"
    value  = "Windows"
    effect = "NO_SCHEDULE"
  }]

  node_role_arn                = [aws_iam_role.worker_role_nt.arn]
  node_role_cni_policy_enabled = false #We use the Service Account as per best practice

  associated_security_group_ids = [data.terraform_remote_state.network.outputs.ops_ssh, aws_security_group.workers.id]

  # Enable the Kubernetes cluster auto-scaler to find the auto-scaling group
  cluster_autoscaler_enabled = true

  context = module.windowslabel.context

  # Ensure the cluster is fully created before trying to add the node group
  module_depends_on = [module.eks_cluster.kubernetes_config_map_id]

  # Ensure ordering of resource creation to eliminate the race conditions when applying the Kubernetes Auth ConfigMap.
  # Do not create Node Group before the EKS cluster is created and the `aws-auth` Kubernetes ConfigMap is applied.
  depends_on = [module.eks_cluster, module.eks_cluster.kubernetes_config_map_id]

  create_before_destroy = true

  node_role_policy_arns = ["arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"]

  block_device_mappings = [
    {
      "delete_on_termination" : true,
      "device_name" : "/dev/xvda",
      "encrypted" : true,
      "volume_size" : 80,
      "volume_type" : "gp3"
    }
  ]

  node_group_terraform_timeouts = [{
    create = "40m"
    update = null
    delete = "20m"
  }]

  #Valid types are "instance", "volume", "elastic-gpu", "spot-instances-request", "network-interface".
  resources_to_tag = ["instance", "volume", "spot-instances-request", "network-interface"]
}

related

@ChrisMcKee
Copy link
Contributor Author

@osterman How often are PRs reviewed ooi?

docs/windows.md Outdated Show resolved Hide resolved
iam.tf Outdated Show resolved Hide resolved
variables.tf Outdated Show resolved Hide resolved
Copy link
Member

@aknysh aknysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisMcKee looks good, please see comments

@aknysh
Copy link
Member

aknysh commented Feb 24, 2023

/test all

@ChrisMcKee
Copy link
Contributor Author

Just trying to work out why it's stopped working. It all spins up but the nodes don't link to eks..
I'm using the completeexample though modified to test this too. I'll update when it plays ball 🤔

@nitrocode
Copy link
Member

@ChrisMcKee resolved conflicts and merged changes in

@ChrisMcKee
Copy link
Contributor Author

@ChrisMcKee
Copy link
Contributor Author

Probably made git cry but rebased off master and squashed out all the autoformat commits etc.
I've pulled the windows example out of examples.tf
I'll add one in a separate PR as there's a bit of faff required when you use a userscript and an ami is used as it needs to append the node to the aws_auth config map (like usual) but the windows node requires an extra permission for ipam.

@ChrisMcKee
Copy link
Contributor Author

@aknysh all good now; I've pushed out a few windows node groups to our internal clusters using this combined with the cluster pr.

@ChrisMcKee
Copy link
Contributor Author

@aknysh sorry for the repeated changes after the fact; I hit an issue randomly after using my fork since the pr went up and, as you do, I assumed I'd missed a commit or messed up. Turned out to be a flipping bug with kube-proxy 🤦 The module as it stands works.

The only remaining issue I have with it is the validation in the variables file over release; the regex is obviously setup like that for a reason but the Windows AMI's dont follow the same setup
being more like 1.24-2023.02.14

README.md Show resolved Hide resolved
docs/windows.md Outdated Show resolved Hide resolved
examples/complete/main.tf Outdated Show resolved Hide resolved
examples/complete/main.tf Outdated Show resolved Hide resolved
examples/complete/main.tf Outdated Show resolved Hide resolved
examples/complete/main.tf Outdated Show resolved Hide resolved
iam.tf Outdated Show resolved Hide resolved
ami.tf Outdated Show resolved Hide resolved
versions.tf Outdated Show resolved Hide resolved
versions.tf Outdated Show resolved Hide resolved
@ChrisMcKee
Copy link
Contributor Author

ChrisMcKee commented Mar 9, 2023

@aknysh sorry for the repeated changes after the fact; I hit an issue randomly after using my fork since the pr went up and, as you do, I assumed I'd missed a commit or messed up. Turned out to be a flipping bug with kube-proxy 🤦 The module as it stands works.

The only remaining issue I have with it is the validation in the variables file over release; the regex is obviously setup like that for a reason but the Windows AMI's dont follow the same setup being more like 2023.02.14

I've changed the regex to the original plus an or to cover the windows ami format.
I've added the unit tests to the commit https://regex101.com/r/xb7q2f/2

ChrisMcKee and others added 2 commits March 9, 2023 11:03
Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
ChrisMcKee and others added 9 commits March 9, 2023 11:03
* Add the windows taint by default to windows nodes
* Correct ami_kind filter
* Alter userdata nt to match current userscript
* Guard userdata against errors from userscript stopping node-join.
* Remove disk_size
* Add windnows_coredns_service_address because EKS Windows networking isn't fab.
…rrect place and remove extraneous params (based on eksctl code base and aws generated script)
@ChrisMcKee ChrisMcKee requested review from nitrocode and aknysh and removed request for florian0410, RothAndrew and nitrocode March 10, 2023 11:04
versions.tf Show resolved Hide resolved
@nitrocode
Copy link
Member

cc @Nuru @aknysh for one more set of eyes

@ChrisMcKee
Copy link
Contributor Author

ChrisMcKee commented Mar 15, 2023

@nitrocode @aknysh This will also resolve pr #137 & issue #133

@aknysh
Copy link
Member

aknysh commented Mar 22, 2023

/test all

Copy link
Member

@aknysh aknysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ChrisMcKee

@aknysh aknysh added the minor New features that do not break anything label Mar 22, 2023
@aknysh aknysh merged commit 814a2f4 into cloudposse:master Mar 22, 2023
Copy link
Contributor

@Nuru Nuru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was already merged, but I made some comments I hope will be addressed in the future.

Comment on lines +48 to +52
windows_taint = [{
key = "OS"
value = "Windows"
effect = "NO_SCHEDULE"
}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A taint like this, with a non-standard key, should at least be optional. I would argue it should be supplied by the user if desired, because they may already be using a different key for tainting by OS or may not want to taint at all and instead rely on node selectors with the "well-known" kubernetes.io/os label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd use the nodeselector but the taint stops the mass of ingress/nginx/haproxy/grafana helm default installs from pointlessly trying to install on the nodes.
I did leave it out originally and just had it in the docs. 🤷‍♀️

capacity_type = var.capacity_type
labels = var.kubernetes_labels == null ? {} : var.kubernetes_labels

taints = local.is_windows ? concat(local.windows_taint, var.kubernetes_taints) : var.kubernetes_taints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This taint should be optional and arguably default to not set. See comment above where the taint is defined.

@@ -4,6 +4,7 @@ Content-Type: multipart/mixed; boundary="/:/+++"
--/:/+++
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of this in userdata scripts because the echo can reveal secrets (echoed data gets logged, logs get shipped to a log aggregator, people without any access to the machines get to read the logs via the log aggregator) and given that users can add snippets to this script without the benefit of context, this risks blowing up a whole node group by something as simple as grep failing to find the pattern in the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the userscript could be surrounded in the same way as the powershell userscript surrounds the pre/post.
-e just ensures that an error raises an error code
-x write to standard error / expand commmand before execution.

Witout it the shipped log would be quite useless / we don't ship userdata but I expect ymmv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave the set -ex up to the user to set. -x can expose secrets the script accesses, -e doesn't just ensure the error raises an error code, is stops the execution of the script if a command returns a non-zero status. Both grep and diff routinely return non-zero status in normal operation.

Nuru added a commit that referenced this pull request Aug 3, 2023
@Nuru Nuru mentioned this pull request Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor New features that do not break anything
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support bottle rocket x86 NVIDIA AMI
5 participants