Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create links for ephemeral storage devices (NVMe) #1131

Closed
stevehipwell opened this issue Dec 16, 2022 · 9 comments
Closed

Create links for ephemeral storage devices (NVMe) #1131

stevehipwell opened this issue Dec 16, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@stevehipwell
Copy link
Contributor

What would you like to be added:
I'd like this AMI to mirror the Bottlerocket behaviour and link ephemeral storage devices as part of bootstrap (see bottlerocket-os/bottlerocket#1173).

Why is this needed:
I'd like to be able to easily make use of NVMe drives using sig-storage-local-static-provisioner.

@bwagner5 bwagner5 added the enhancement New feature or request label Jan 24, 2023
@bwagner5
Copy link
Contributor

bwagner5 commented Feb 2, 2023

Is the use-case here to use NVMe drives as a PV for a pod? Or is it to just use the increased performance and storage that you're paying for on storage capable instance type variants (d, i, etc)?

I have a proposed solution for the latter, which I think could be included into the bootstrap script as an option to RAID-0 the instance storage volumes and mount /var/lib/kubelet & /var/lib/containerd.

WDYT of something like this?

> bootstrap.sh MyFancyCluster ... --nvme-raid0 

which would look for unmounted NVMe instance storage drives, and place them in a RAID-0 array w/ mdadm, and an XFS file-system. Any state in /var/lib/kubelet or /var/lib/containerd would be moved to the instance storage disk array and then mounted to those dirs.

@stevehipwell
Copy link
Contributor Author

  1. Can I also have the NVMe drives mounted individually like Bottlerocket so I can consume them with the local provisioner?
  2. Can I have a separate Raid 0 volume composed of the NVMe disks?
  3. Will these behaviours work for spot instances where only some instances might have NVMe disks?
  4. Can the script add labels to instances with the NVMe disks mounted?

@bwagner5
Copy link
Contributor

bwagner5 commented Feb 2, 2023

  1. Can I also have the NVMe drives mounted individually like Bottlerocket so I can consume them with the local provisioner?

Yeah, I think that's possible. Would probably just change the arg to accept a parameter rather than a boolean.

  1. Can I have a separate Raid 0 volume composed of the NVMe disks?

👍 Yep, this is basically the general purpose case that I was solving for.

  1. Will these behaviours work for spot instances where only some instances might have NVMe disks?

In Karpenter's case, yes, since we're able to know how much storage would be available at bin-packing time. I'm not sure if this would work with CAS. I think we'd also want to fail gracefully, so if there are no nvme instance storage disks, and the raid0 arg was enabled, then it wouldn't do anything and continue to use the EBS volume.

  1. Can the script add labels to instances with the NVMe disks mounted?

hmmm... this might be difficult. Do you mean labels on the node or tags on the instance? Karpenter already adds labels on the node for local nvme storage. Would that be enough so that workloads can select based on nvme storage available (or not) ?

@stevehipwell
Copy link
Contributor Author

RE labeling, I was thinking a node label if the command was successful. If the arg was set but there were no NVMe drives no label(s). So effectively I could spin up both types of node, raid 0 & separate disks, and assign workload accordingly.

Should this implementation wait on the AMI supporting multiple EBS like Bottlerocket? Or even be implemented in tandem?

@bwagner5
Copy link
Contributor

bwagner5 commented Feb 3, 2023

I suppose we could label the nodes, but it would be weird for systems like Karpenter that precompute the scheduling decision. The existence of the label may cause pods to not schedule even when kube-scheduler and karpenter think that it should. It would recover, just wouldn't be optimal. Karpenter should still cover this case with the nvme labels: https://karpenter.sh/v0.23.0/concepts/scheduling/#selecting-nodes:~:text=karpenter.k8s.aws/instance%2Dlocal%2Dnvme

As discussed offline, I think this would can happen independently since the multi-EBS volume setup breaks backwards compatibility and the RAID setup is completely backwards compatible.

@stevehipwell
Copy link
Contributor Author

@bwagner5 correct me if I'm wrong but labels only have an additive scheduling impact for Karpenter? Pods will schedule to node with unknown labels, but adding a label such as aws.io/nmve-ephemoral (just an example name) when this runs could be used for affinity based scheduling.

@bwagner5
Copy link
Contributor

bwagner5 commented Feb 3, 2023

If there's an anti-affinity to the label, then that would affect scheduling, or a node selector to labels that Karpenter doesn't know about because they get applied at startup, so Karpenter would never provision a node since it doesn't know the requirement will be fulfilled.

@stevehipwell
Copy link
Contributor Author

OK, fair enough. Is it documented for Karpenter that you shouldn't use dynamic labels? I think we've stopped using these now but I'd have to check.

@bwagner5
Copy link
Contributor

bwagner5 commented Feb 3, 2023

I don't think we explicitly call it out in the Karpenter docs. Maybe we should add that. I don't hear many people doing dynamic labels at startup though. I would guess the practice is known as bad since any node autoscaler should fallover in some regard when dynamic labels are at play since the scheduling simulation can't be accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants