Skip to content

Conversation

@jkyros
Copy link
Contributor

@jkyros jkyros commented Dec 22, 2025

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

So back when we did #7992, we made VPA updates concurrent, which was cool, but it maaaabye introduced some race conditions. So then we did #8320 to try and get some safety around that, but it looks like there are still quite a few possible races in there. 🙈

I think this can be worked around by setting --update-worker-count=1 (but you will obviously lose the performance you were getting from the concurrency).

Anyway, I'm currently getting reports of folks with big AI clusters hitting fatal concurrent map writes in UpdateVPA() in 1.5:

fatal error: concurrent map writes

goroutine 719 [running]:
internal/runtime/maps.fatal({0x217e142?, 0xc00195ada0?})
	/usr/lib/golang/src/runtime/panic.go:1058 +0x18
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/utils/metrics/recommender.(*ObjectCounter).Add(0xc0005882b8, 0xc001f00e40)
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/utils/metrics/recommender/recommender.go:192 +0x2b2
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines.(*recommender).UpdateVPAs.func1()
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines/recommender.go:144 +0xa8
created by k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines.(*recommender).UpdateVPAs in goroutine 1
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines/recommender.go:131 +0x14b

goroutine 1 [sync.WaitGroup.Wait]:
sync.runtime_SemacquireWaitGroup(0xc0000638a0?)
	/usr/lib/golang/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0xc0016fea10?)
	/usr/lib/golang/src/sync/waitgroup.go:118 +0x48
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines.(*recommender).UpdateVPAs(0xc001f00a80)
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines/recommender.go:158 +0x272
k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines.(*recommender).RunOnce(0xc001f00a80)
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/routines/recommender.go:190 +0x332
main.run({0x2469858, 0x36ff760}, 0xc0001d0090, 0xc00040c500)
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/main.go:337 +0x130d
main.main()
	/go/src/k8s.io/autoscaler/vertical-pod-autoscaler/pkg/recommender/main.go:160 +0x685

It seems like it's happening every few minutes:

READY   STATUS    RESTARTS        AGE
vpa-recommender-default-d8f9954d8-ljzxr   1/1     Running   129 (24s ago)   15h

I think it's (well, at least one of them) is happening here where if two VPA updates come through for the same VPA and processVPAUpdate(r, vpa, observedVpa) writes the conditions while cnt.Add(vpa) reads them, we die:

if !found {
return
}
processVPAUpdate(r, vpa, observedVpa)
cnt.Add(vpa)
}

So my thinking was that if we got a mutex around the VPA we could protect that too.

But there is that spot today where UpdateConditions() calls HasRecommendation() from inside itself,

func (vpa *Vpa) UpdateConditions(podsMatched bool) {
reason := ""
msg := ""
if podsMatched {
delete(vpa.Conditions, vpa_types.NoPodsMatched)
} else {
reason = "NoPodsMatched"
msg = "No pods match this VPA object"
vpa.Conditions.Set(vpa_types.NoPodsMatched, true, reason, msg)
}
if vpa.HasRecommendation() {
vpa.Conditions.Set(vpa_types.RecommendationProvided, true, "", "")
} else {
vpa.Conditions.Set(vpa_types.RecommendationProvided, false, reason, msg)
}
}
// AsStatus returns this objects equivalent of VPA Status. UpdateConditions
// should be called first.

so those private xxxLocked() functions I added are so we don't have to worry about "recursive locking"/deadlocks.

So yeah this is bacially:

  • Add a mutex to the VPA
  • Replace direct access to Conditions/Recommendations with locking method access
  • Add private "already locked" functions to prevent recursive locking issues
  • (I know races on pointers/slices aren't fatal, but if I don't protect Recommendations too, I fail all the data race tests)

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • I'm not in love with the approach I've taken here, and we probably need to do something more holistic, but I'm on vacation and we were talking about the deadlines for 1.6 so uh...I wanted to get something up to at least frame trying to fix this 😅
  • The test suite is probably excessive but I wrote those hunting it, and I figured I'd give you all I had

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Protecting just the ObjectCounter isn't good enough, we've also got
vulnerability on Conditions and Recommendations. We've been seeing
concurrent map access panics where it looks like two goroutines both try
to touch Conditions at the same time.

This tries to better protect the VPA fields via locking wrapper
functions such that concurrent access can't happen. I did not have the
time to go through for a more elegant refactor, I just wanted to stop
the crash, it's disruptive for users.

Basically I'm doing locking methods, but because of some of our
nesting I also had to add private "already locked" methods so
they could safely call inside themselves without trying to recursively
grab the lock.

Signed-off-by: John Kyros <jkyros@redhat.com>
There are several race conditions present around updating the VPA
concurrently. I added tests to test the fixes for specifically the
Conditions concurrent map access we were seeing, as well as some other
tests that highlight additional issues I didn't have time to address
immediately. Those additional tests are set to Skip by default, but if
you uncomment them they will detail those additional issues.

Signed-off-by: John Kyros <jkyros@redhat.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. labels Dec 22, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jkyros
Once this PR has been reviewed and has the lgtm label, please assign omerap12 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-area cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 22, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jkyros. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/vertical-pod-autoscaler size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed do-not-merge/needs-area labels Dec 22, 2025
Since I added a mutex to the VPA model, the checkpoint text can no
longer copy it by value and has to use references. This just updates the
checkpoint writer test to account for this.

Signed-off-by: John Kyros <jkyros@redhat.com>
@adrianmoisey
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 22, 2025
@adrianmoisey
Copy link
Member

  • I'm not in love with the approach I've taken here, and we probably need to do something more holistic, but I'm on vacation and we were talking about the deadlines for 1.6 so uh...I wanted to get something up to at least frame trying to fix this 😅

Thanks for doing this! I appreciate it. This does feel like a 1.6 blocker to me, so I'll see if I can also help push this forward.

@@ -0,0 +1,412 @@
/*
Copyright 2024 The Kubernetes Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need to write the year anymore ( and it's almost 2026 )

Comment on lines 56 to 58
func isFetchingHistory(vpa *model.Vpa) bool {
condition, found := vpa.Conditions[vpa_types.FetchingHistory]
if !found {
return false
}
return condition.Status == v1.ConditionTrue
return vpa.ConditionActive(vpa_types.FetchingHistory)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so weird, I can't see anything that sets this condition.
I've made #8969 to dig into this more later.

Comment on lines -447 to +449
delete(feeder.clusterState.VPAs()[vpaID].Conditions, condition.conditionType)
feeder.clusterState.VPAs()[vpaID].DeleteCondition(condition.conditionType)
} else {
feeder.clusterState.VPAs()[vpaID].Conditions.Set(condition.conditionType, true, "", condition.message)
feeder.clusterState.VPAs()[vpaID].SetCondition(condition.conditionType, true, "", condition.message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR to change this delete/create logic into setting a condition to true/false.
But, your PR takes priority here, so I'll rebase mine once you're done

@adrianmoisey
Copy link
Member

In general this looks fine to me, it would be nice to get another set of eyes on it.

There is definitely a smell here that needs solving, which is possibly a larger refactor of the recommender.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants