-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MaxCheckpointsPerContainer to the kubelet #115888
Add MaxCheckpointsPerContainer to the kubelet #115888
Conversation
Hi @adrianreber. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/remove-sig api-machinery |
@adrianreber please fix CI failures, thanks. |
fddbb33
to
58e029e
Compare
/test pull-kubernetes-conformance-kind-ipv6-parallel |
/retest-required |
d76d6ad
to
32fb688
Compare
/test pull-kubernetes-e2e-kind |
/test pull-kubernetes-e2e-kind-ipv6 |
2 similar comments
/test pull-kubernetes-e2e-kind-ipv6 |
/test pull-kubernetes-e2e-kind-ipv6 |
@rphillips @mikebrow ptal. |
/lgtm |
LGTM label has been added. Git tree hash: 92f3e5c378642a22b3b5f4456dd89b6bb8cb0ccd
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments..
A general comment: This is for the forensics use case... but I'm having a time mapping the kep to this change. It reads more like a checkpoint manager enhancement, where one can be creating some numbers of checkpoints at a given rate of time, like backups. And kubelet will manage garbage collection. I was thinking we'd open the kep up for the additional case(s) and add a checkpoint manager. Thoughts? Maybe draft up a use case description for this enhancement in the context of forensic debug..
pkg/kubelet/apis/config/types.go
Outdated
|
||
// MaxCheckpointsPerContainer specifies the maximum number of checkpoints | ||
// that Kubernetes will create of one specific container before removing old | ||
// checkpoints. This option exist to ensure the local disk is not filled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// checkpoints. This option exist to ensure the local disk is not filled | |
// checkpoints. This option exists to ensure the local disk is not filled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it will fill it.. 2147483647 checkpoints probably isn't the right maximum limit to the number of checkpoints per container per pod..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying a hardcoded upper limit should exist? What should it be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a few or two as default, depending on how the user/client is using these. For forensics not sure why it's more than 1 at a time unless you envision a diff tool to compare a success case vs failure case which would be 2 or 3, where 3 could allow 3way diff cases? Still.. the idea of creating a managed set.. implies the discussion of a manager for checkpoints.
// that Kubernetes will create of one specific container before removing old | ||
// checkpoints. This option exist to ensure the local disk is not filled | ||
// with container checkpoints. This option is per container. The field | ||
// value must be greater than 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size or total % of disk use or a different drive or some other reasonable mech for lowering impact.. perhaps default of 1 or 2? with the idea the user would off load one before adding another..
// that Kubernetes will create of one specific container before removing old | ||
// checkpoints. This option exist to ensure the local disk is not filled | ||
// with container checkpoints. This option is per container. The field | ||
// value must be greater than 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put another way.. why was 10 selected and not 1?
podFullName, | ||
containerName, | ||
time.Now().Format(time.RFC3339), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking for the forensic case the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking for the forensic case
Even if the story around checkpoint/restore is the forensic case, this does not mean this is what people use it for. There are many use cases and the forensic use case is just one.
the caller would be responsible for the clean up if need (probably with a listener) and the use would be rare.
This is a strange argument from my point of view, that the user is responsible for the clean up. At this point it feels the complete PR is questioned and this argument comes very late in the lifetime of this PR. This PR has been reworked completely multiple times since the first posting 7 months ago. This is a feature people have been asking for because they see it as a problem and do not want to clean up manually. This is one of the first things people are asking for during conference talks.
Not sure what the goal of this comment is, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for limiting to the forensics case was probably to avoid slow walking the managed checkpoint use cases. Until there was a WG or whatnot put in place to design how kubernetes would manage the checkpoints for the other cases for what people would use it for.
At this point the user requested the checkpoint, vs kubelet creating the checkpoint based on a pod policy / contract. Asking kubelet to garbage collect that which it did not ask to be created implies kubelet knows why the containers are being checkpointed for these pods.
If the rule is to keep 10 which 10, the last ten the first 10 what if the last 10 are all created during failure modes? If this is for rolling backups.. you may want one per month one per week...going back to the last month to drop the per week from that month etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. today was the first day I heard about this PR... Apologies for the frustration. I agree 100%, overuse of the checkpointing end point without having a management design is a problem. I didn't expect over use to be a problem for the sig-node approved forensics use case. That's all I mean. For logs we've used similar designs to this one, in kubelet, to employ rolling log models for long lived containers. So I'm trying to understand this rolling checkpoint idea.. to map it to a use case. If someone is wanting to do rolling backups ok.. I would get the use case, but even in that case I would want to have a discussion about how we do it. Checkpoint 1 + delta + delta for example would be orders of magnitude better from a resource consumption perspective.
pkg/kubelet/kubelet.go
Outdated
// name of the checkpoint archive is unique. It already contains a time | ||
// stamp but in case the clock resolution is not good enough this counter | ||
// make it unique. | ||
maxCheckpointsPerContainerCounter int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from this I take it that you want to use the counter as a node level index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest maxCheckpointOnNodeCounter or something similar..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included OnNode
in the variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx..
podFullName, | ||
containerName, | ||
time.Now().Format(time.RFC3339), | ||
time.Now().UnixNano(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why unix nano? going from human readable format for the forensic case to a large number...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I submitted this comment two days ago. Trying once more:
This PR has gone though multiple iterations. The first iterations was using the existing file name with the human readable and the number of checkpoints was tracked in a JSON file. That was then changed to work without a JSON file using stat()
. The time resolution using stat()
on all file systems was questioned so the goal was to use a regex and file system sorting. The problem with the human readable file name was that it is really hard to capture it via regex if timezones are used in the file name. In combination with pod-name
container-name
collisions during regex or file system level sorting the current implementation using an integer time stamp and a counter.
As the second resolution is to coarse to be unique (without the counter) I switched to something else. I thought about micro or nano seconds, and settled for nano seconds because it will be just as unreadable as micro seconds.
We have written the tool checkpointctl
which will display all information, including time created and checkpointed, so if wanted it can be displayed with the help of an additional tool.
The current approach gives us file system based sorting by using unique names in combination with podUID, nano seconds and node wide counter in the file name. No external JSON file is any more needed or no need to ready the result of stat()
which might be to coarse, depending on the used file system.
With the old implementation if would have been possible to create two checkpoints of different pod/containers at the same time and once would replace another. The current implementation does not have this problem and easier to sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking up... the dreaded windows aversion to colon due to their drive letter :
format..
sounds like you had a lot of reasons..
pkg/kubelet/kubelet.go
Outdated
podFullName, | ||
containerName, | ||
time.Now().Format(time.RFC3339), | ||
time.Now().UnixNano(), | ||
int(math.Log10(float64(math.MaxInt32)))+1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? pls add a comment explaining this ..
width of the format will be 10 every time right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for sorting the counter should always include all leading zeroes. I can change this to 10. I just didn't want to count it manually and let the computer do the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk.. :-) leading for the nanos to? though I think better off with the human readable on the timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hardcoded the number leading zeros for both fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
|
||
checkpointDirectoryPath := filepath.Join( | ||
kl.getCheckpointsDir(), | ||
string(podUID), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe podUID can be reused here.. which could create a reuse issue ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the namespace, pod and container have the same name or one of the described collisions above. Thanks to the timestamp and the counter it should not result in checkpoints being overwritten, but maybe older checkpoints, from a podUID collisions, being removed first. Not sure if that is a problem or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah was more a tree/identifier issue... here with poduid.. one will need to tree | grep to find their checkpoints by name, here you can also identify by poduid.. if that was nec.. and missing we could add the poduid to the original name as an alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the checkpoint is returned when triggering the kubelet checkpoint API endpoint. I am also working on code to see how it could look like if the kubelet checkpoint API endpoint is available on the kubectl level.
What I am trying to say is that searching for a checkpoint archive is something I would not expect that happens a lot, so that the path is not super important. Whatever makes most sense works for me. The main goal is to avoid any ambiguity in the file name. The podUID was suggested somewhere along the review in this PR and makes sense to me.
There was also agreement that we can change the location as long this is still marked as an alpha feature.
As mentioned in another comment using https://github.com/checkpoint-restore/checkpointctl is something I would recommend to get all details from the checkpoint archive and rely less on encoding information into the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod.. typically, in this space, we would create references to image files like these in a meta db.. and the image/layers would be stored by sha.. This change looks/feels like we're inching into managed checkpoint use case scenarios. I would feel more comfortable about this change if we had a kep open for r.next checkpointing/ wg charting out the rough direction this is going to take so we could map this change to that direction.
The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases? IMO there are other designs that would more appropriately address the desired feature(s).
At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pain point this PR is addressing is over use of the drive.. which should not be happening in the approved forensics cases?
It feels like this seems to be one of our main discussion points. I agree that it "should not be happening" but at the same time it is something people are actively talking about. From my point of view it feels unrealistic to expect that people always clean up unused files. That is why I want to help by cleaning it up automatically.
At min, IMO, we should tell sig-node about this change to begin automatically removing checkpoints past some limit and get a general consensus if that is acceptable in the interim before the next checkpoint KEP update.
Okay. I will add it on the sig-node agenda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to introduce this clean up functionality we need to add it to the existing KEP as an additional feature or start a new one. As discussed at SIG Node meeting today, once we have this setting we can start exploring integrating this into GC and eviction logic, having per-pod policies, etc. This all creates an unnecessary burden on a kubelet.
32fb688
to
6f93687
Compare
New changes are detected. LGTM label has been removed. |
/retest-required |
This adds the configuration option "MaxCheckpointsPerContainer" to the kubelet. The goal of this change is to provide a mechanism in combination with container checkpointing to avoid filling up all existing disk space by creating a large number of checkpoints from a container. "MaxCheckpointsPerContainer" defaults to 10 and this means that once 10 checkpoints of a certain container have been created the oldest existing container checkpoint archive will be removed from disk. This way only the defined number of checkpoints is kept on disk. This also moves the location of the checkpoint archives from /var/lib/kubelet/checkpoints to /var/lib/kubelet/checkpoints/POD-ID/ The main reason for this move was to avoid confusion between the checkpoint archives concerning namespace, pod name and container name. This also changes the time stamp encoded into the file name from RFC3339 to UnixNano(). The reason for this change is that there were questions in how far the ':' in the file name generated by RFC3339 would be problematic on Windows. This also introduces a counter after the time stamp in the file name to ensure that each checkpoint archive has a unique file name. Signed-off-by: Adrian Reber <areber@redhat.com>
6f93687
to
e89d14d
Compare
I feel like to the big extend this functionality is suggested for the kubelet because it is guaranteed to be present on the node. If we only look at checkpoints as a forensic mechanism today, this functionality does not belong to kubelet. Maybe a better place will be some separate agent. For example, we may consider updating the NPD to detect and resolve local problems. So NPD can do checkpoint files rotation. Or kubelet health checks. Or similar. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This adds the configuration option "MaxCheckpointsPerContainer" to the kubelet. The goal of this change is to provide a mechanism in combination with container checkpointing to avoid filling up all existing disk space by creating a large number of checkpoints from a container.
"MaxCheckpointsPerContainer" defaults to 10 and this means that once 10 checkpoints of a certain container have been created the oldest existing container checkpoint archive will be removed from disk. This way only the defined number of checkpoints is kept on disk.
This also moves the location of the checkpoint archives from
/var/lib/kubelet/checkpoints
to/var/lib/kubelet/checkpoints/POD-ID/
.The main reason for this move was to avoid confusion between the checkpoint archives concerning namespace, pod name and container name.
This also changes the time stamp encoded into the file name from RFC3339 to UnixNano(). The reason for this change is that there were questions in how far the ':' in the file name generated by RFC3339 would be problematic on Windows.
This also introduces a counter after the time stamp in the file name to ensure that each checkpoint archive has a unique file name.
As this is still an Alpha feature it should be acceptable to change the location of the checkpoint archive.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: