Skip to content

LV node migration #314

Open
Open
@pschichtel

Description

@pschichtel

Describe the problem/challenge you have

I'm hosting various clustered and stateful applications in kubernetes. Some of these applications require low-latency IO to perform well, like databases and message queues, that's way I use local PVs for these applications, which works great. This way I can put very fast SSDs into these servers and use them without network overhead.

My only pain-point with this setup is (unsurprisingly): The pods, once scheduled, are pinned to their node forever. The only way to move the pod is to delete both the PVC and the pod and hope that the scheduler doesn't decide to put it back onto the same node (sure, this can be helped with node selectors, affinities, anti affinities and taints, but that's even more complexity). An additional, possibly more serious depending on the application, is the fact that node failures can't be recovered from automatically. Even if the application is able to restore its state from remaining peers in its cluster, kubernetes won't execute the pod because it's pinned to a node that's unavailable.

Describe the solution you'd like

Currently, at least that's my current understanding, when kubernetes schedules the pod it works like this (simplified):

  • if volumeBindingMode is WaitForFirstConsumer, then k8s places the pod and then requests a PV
  • if volumeBindingMode is Immediate, then k8s places the pod on a node that can access the PV

The former means that lvm-localpv will create a LV on the node that's selected for the pod, the latter means k8s places the pod on the single node carries that LV that has been eagerly created. Either way, it ends with a pod pinned to a node.

What I would love to see is to make an LV available to all nodes in the cluster independent of where it is physically placed. If the LV is already allocated on a node and kubernetes happens to pick a different node, then just create a new LV on the new node, transfer the LV content over the network and delete the old LV. If the LV does not exist already, then it can simply be created on the node that was picked.

That would obviously significantly delay pod startup depending on the size of the volume and it might require a dedicated high-bandwidth network for the transfer as to not interrupt other communication in the kubernetes cluster, but for application clusters that are highly redundant and can cover a failed replica for a prolonged period, this could be perfectly fine.

And actually this could go one step further: Assuming that the application can restore its state from peers in its cluster, a feasible LV migration strategy would be to create a new empty LV without transferring data and let the application do the "transfer".

I could imagine this as a StorageClass option like dataMigrationMode with values:

  • Disabled (default): current behavior: pin the application to the node with the LV
  • Application: Just delete the LV on the old node and create a new one on the new node and let the application handle the migration
  • VolumeTransfer: Create a new LV and transfer data to it before mounting it.

Anything else you would like to add:

While the VolumeTransfer option would be awesome, it also probably quite involved. So being able to just get a new LV on a new node would probably easier. I guess this also requires applications to be well behaved and deployments well configured to not accidentally delete all the data during a rolling upgrade.

Metadata

Metadata

Assignees

Labels

kind/improvementCategorizes issue or PR as related to improving upon a current featuremilestone/needs-trackingIndicates that an issue or PR needs to be tracked on a milestoneto-be-scopedNeed scoping

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions