Skip to content

KEP-5901: Add API server Checkpoint KEP #5092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adrianreber
Copy link
Member

@adrianreber adrianreber commented Jan 27, 2025

With "Forensic Container Checkpointing" being Beta and discussions around graduating it to GA, the next step would be API server integration of the container checkpointing functionality.

In addition to the "Forensic Container Checkpointing" use case this KEP lists multiple use cases how checkpointing containers can be used.

One of the main motivations for this KEP is to make it easier for users to checkpoint containers, independent of the reason. Having it available via kubectl reduces the complexity of connecting to the node and accessing the kubectl checkpoint API endpoint.

  • One-line PR description: adding new KEP

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/cli Categorizes an issue or PR as relevant to SIG CLI. labels Jan 27, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
Once this PR has been reviewed and has the lgtm label, please assign ardaguclu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @adrianreber. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 27, 2025
@adrianreber adrianreber mentioned this pull request Jan 27, 2025
6 tasks
Comment on lines +195 to +198
Beta in Kubernetes 1.30, which means that the corresponding feature gate
defaults to the feature being enabled, the next step would be to extend the
existing checkpointing functionality from the *kubelet* to *kubectl* for easier
user consumption. The main motivation is to make it easier by not requiring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the criteria followed for kubectl commands, but should not wait first for the feature to be GA so it is available? or is the plan to GA KEP-2008 and add the kubectl command at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't know. Not sure. But as there is a possibility to add plugins for kubectl or alpha commands I thought it is possible to expose non GA features on the api server level. But I don't know.

Currently the design details are based on the existing pull request: [Add
'checkpoint' command to kubectl][pr120898]

The API server is extended to handle checkpoint requests from *kubectl*:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most important change , adding a new endpoint to the apiserver is where you need to expand, Jordan also commented here kubernetes/kubernetes#120898 (comment) along those lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry. But what is necessary. The comment you linked to was talking about the kubelet KEP which didn't had any reference to kubectl or the to the API server. This here is created to address the mentioned comment.

for the initialization to finish. The startup time is reduced to the time
necessary to read back all memory pages to their previous location.

This feature is already used in production to decrease startup time of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claims should have links to references

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

This feature is already used in production to decrease startup time of
containers.

Another similar use case for quicker starting containers has been reported in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

#### Optimize Resource Utilization

This use case is motivated by interactive long running containers. One very
common problem with things like Jupyter notebooks or remote development
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive questions, is this state not stored in a database or some persistent storage and recovered when it reconnects? at the end you'll have to keep all the state stored somewhere

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, if I understand your concerns, but yes, there is still need to store some state at application level (such as set of suspended/running jupyter notebooks, this is here Ch/R does not help directly), but native checkpoint/restore capability is still needed for the application (jupyter notebook) itself because the notebook does not save its memory/cpu/gpu. And yes, this state is needed, jupyter has internal capability of checkpointing, but not if some jupyter cell is running. It is also unable to checkpoint running python krenel and this is primary motivation here. We have usecases, where the python kernel is running for a very long time and we need to checkpoint it at CRIU level.


#### Container Migration

One of the main use cases for checkpointing and restoring containers is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migration between nodes? the IPs are most likely to be lost so the application has to be agnostic of the IP per example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a paragraph concerning migration of TCP connections.

migrate containers or processes. It is a well researched topic especially
in the field of high performance computing (HPC). To avoid loss of work
already done by a container the container is migrated to another node before
the current node crashes. There are many scientific papers describing how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one or two references to this papers will be nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

case, only useful for stateful containers.

With GPUs becoming a costly commodity, there is an opportunity to help
users save on costs by leveraging container checkpointing to prevent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workloads are already doing checkpointing , do you know what is the state of the art of existing checkpointing mechanisms vs container checkpointing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the hard question for the last 25 years. What is better. Application level checkpointing or system level checkpointing. Both approaches have their advantages and drawbacks. As it is unlikely that every application will have application level checkpointing some workloads can only be migrated with system level checkpointing.

Currently there are multiple startups and scientific researchers trying to solve how to better use GPU resources. All of the one I have been following are betting on system level checkpointing as application level checkpointing does not work or exist. The main problem is that for every application it has to be re-implemented. But as I am coming from the system level checkpointing area I am probably biased.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are only a few applications that have checkpoint mechanisms implemented. One of them is GROMACS for chemical simulations, which has lightweight checkpoint support. However, it has some bugs and does not always work correctly. These checkpoints are periodic—i.e., GROMACS saves its state from time to time to allow resumption. You can configure the checkpoint period, but if you set it too frequently, it decreases overall performance. Conversely, if the interval is too long, you may lose computational time if something happens and you need to roll back to the last checkpoint.

I have heard that there is also a checkpoint mechanism in the Amber tool (used for force field simulations), but it is complicated to set up.

Some applications, such as JupyterLab, also offer checkpoint-like mechanisms. However, these are not true checkpoints since you cannot restore a running Jupyter cell or IPython kernel from them. Essentially, you can only restore the content of the cell sheet displayed in the web UI.

On the other hand, VM snapshots (often referred to as "checkpoints" but commonly called snapshots) have existed for many years. They are used for live VM migrations, rollback of changes, and other purposes. To my knowledge, with QEMU KVM, migrating a VM with an attached GPU is not possible. However, @rst0git has demonstrated that, on the same GPU architecture, a Pod can actually be migrated even when using a GPU.

The examples above show that checkpointing methods do exist, but as demonstrated, each application requires extra effort to be internally checkpointed compared to the checkpoint/restart (CR) mechanism at the system level. In theory, you could add a checkpoint mechanism to a framework such as PyTorch, but that would not be sufficient—you would still need to handle your application's state, its files, connections, etc. In contrast, significant work has already been done on containerd (@adrianreber) and CRIU (@rst0git); we now need to make these solutions more user-friendly.

Another consideration is the container filesystem. Even with application-level checkpoints, you lose the contents of a running container's filesystem if changes occur at runtime. For example, if temporary files are created in /tmp, they will be lost unless explicitly handled when the Pod is migrated or restarted. This situation is also addressed by checkpoint mechanisms at the system level.

Container migration for load balancing is something where checkpoint/restore
as implemented by CRIU is already used in production today. A prominent example
is Google as presented at the Linux Plumbers conference in 2018:
[Task Migration at Scale Using CRIU]<[task-migration]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example says that connections are dropped and client must reconnect, this is well understood at google where are librarie and applications that handle the client side reconnection, but my observation is that most people expect to auto-magically reconnect, and AFAIK this will not do it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a paragraph that talks about TCP connection and checkpoint/restore.

##### Spot Instances

Yet another possible use case where checkpoint/restore is already used today
are spot instances. Spot instances are usually resources that are cheaper but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to take into account the time you have for checkpointing, as spot is like that, eventually you'll get destroyed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a reference to an existing solution that handles this.

Comment on lines +487 to +522
Also, *kubectl* is extended to call this new API server interface. The API
server, upon receiving a request, will call the kubelet with the corresponding
parameters passed from *kubectl*. Once the checkpoint has been successfully written
to disk *kubectl* will return the name of the node as well as the location of
the checkpoint archive to the user:
Copy link
Member

@aojea aojea Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented above, this is the most tricky part of the KEP, you need to expand on the technical design here, these endpoints are complex to implement also you need to play with version skews between apiserver, kubelet and container runtimes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I am not sure what is needed here. I described the API to be just as the API provided by the kubelet. It just forwards everything 1:1 to the kubelet. Concerning different versions of api server and kubelet I described in one section that it will probably just return an error. I guess I do not get it what is required here.

Any existing examples I can take a look at?

With "Forensic Container Checkpointing" being Beta and discussions
around graduating it to GA, the next step would be kubectl integration
of the container checkpointing functionality.

In addition to the "Forensic Container Checkpointing" use case this KEP
lists multiple use cases how checkpointing containers can be used.

One of the main motivations for this KEP is to make it easier for users
to checkpoint containers, independent of the reason. Having it available
via kubectl reduces the complexity of connecting to the node and
accessing the kubectl checkpoint API endpoint.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber adrianreber force-pushed the 2025-01-27-kubectl-checkpoint branch from 1b65fed to cccb3cb Compare March 10, 2025 15:47
@adrianreber
Copy link
Member Author

@aojea thanks for your review. I added a couple more references.

@adrianreber adrianreber changed the title KEP-5901: Add Kubectl Checkpoint KEP KEP-5901: Add API server Checkpoint KEP May 19, 2025
@adrianreber
Copy link
Member Author

/sig api-machinery

Changing from SIG CLI to SIG API machinery after SIG CLI mentioned that new commands are first introduced via a plugin. This moves the responsible SIG to API machinery as that it now the goal of this KEP. To extend the kubelet API to the API server to be usable from a kubectl plugin.

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 19, 2025
@adrianreber
Copy link
Member Author

/remove-sig cli

@k8s-ci-robot k8s-ci-robot removed the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

4 participants