-
Notifications
You must be signed in to change notification settings - Fork 4.3k
VPA: Fix recommender race conditions for vpa Conditions and Recommendations #8967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Protecting just the ObjectCounter isn't good enough, we've also got vulnerability on Conditions and Recommendations. We've been seeing concurrent map access panics where it looks like two goroutines both try to touch Conditions at the same time. This tries to better protect the VPA fields via locking wrapper functions such that concurrent access can't happen. I did not have the time to go through for a more elegant refactor, I just wanted to stop the crash, it's disruptive for users. Basically I'm doing locking methods, but because of some of our nesting I also had to add private "already locked" methods so they could safely call inside themselves without trying to recursively grab the lock. Signed-off-by: John Kyros <jkyros@redhat.com>
There are several race conditions present around updating the VPA concurrently. I added tests to test the fixes for specifically the Conditions concurrent map access we were seeing, as well as some other tests that highlight additional issues I didn't have time to address immediately. Those additional tests are set to Skip by default, but if you uncomment them they will detail those additional issues. Signed-off-by: John Kyros <jkyros@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jkyros The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @jkyros. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Since I added a mutex to the VPA model, the checkpoint text can no longer copy it by value and has to use references. This just updates the checkpoint writer test to account for this. Signed-off-by: John Kyros <jkyros@redhat.com>
|
/ok-to-test |
Thanks for doing this! I appreciate it. This does feel like a 1.6 blocker to me, so I'll see if I can also help push this forward. |
| @@ -0,0 +1,412 @@ | |||
| /* | |||
| Copyright 2024 The Kubernetes Authors. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need to write the year anymore ( and it's almost 2026 )
| func isFetchingHistory(vpa *model.Vpa) bool { | ||
| condition, found := vpa.Conditions[vpa_types.FetchingHistory] | ||
| if !found { | ||
| return false | ||
| } | ||
| return condition.Status == v1.ConditionTrue | ||
| return vpa.ConditionActive(vpa_types.FetchingHistory) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so weird, I can't see anything that sets this condition.
I've made #8969 to dig into this more later.
| delete(feeder.clusterState.VPAs()[vpaID].Conditions, condition.conditionType) | ||
| feeder.clusterState.VPAs()[vpaID].DeleteCondition(condition.conditionType) | ||
| } else { | ||
| feeder.clusterState.VPAs()[vpaID].Conditions.Set(condition.conditionType, true, "", condition.message) | ||
| feeder.clusterState.VPAs()[vpaID].SetCondition(condition.conditionType, true, "", condition.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a PR to change this delete/create logic into setting a condition to true/false.
But, your PR takes priority here, so I'll rebase mine once you're done
|
In general this looks fine to me, it would be nice to get another set of eyes on it. There is definitely a smell here that needs solving, which is possibly a larger refactor of the recommender. |
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
So back when we did #7992, we made VPA updates concurrent, which was cool, but it maaaabye introduced some race conditions. So then we did #8320 to try and get some safety around that, but it looks like there are still quite a few possible races in there. 🙈
I think this can be worked around by setting
--update-worker-count=1(but you will obviously lose the performance you were getting from the concurrency).Anyway, I'm currently getting reports of folks with big AI clusters hitting fatal concurrent map writes in
UpdateVPA()in 1.5:It seems like it's happening every few minutes:
I think it's (well, at least one of them) is happening here where if two VPA updates come through for the same VPA and
processVPAUpdate(r, vpa, observedVpa)writes the conditions whilecnt.Add(vpa)reads them, we die:autoscaler/vertical-pod-autoscaler/pkg/recommender/routines/recommender.go
Lines 140 to 145 in e4a0776
So my thinking was that if we got a mutex around the VPA we could protect that too.
But there is that spot today where
UpdateConditions()callsHasRecommendation()from inside itself,autoscaler/vertical-pod-autoscaler/pkg/recommender/model/vpa.go
Lines 256 to 275 in e4a0776
so those private
xxxLocked()functions I added are so we don't have to worry about "recursive locking"/deadlocks.So yeah this is bacially:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: