Skip to content

Comments

✨ Use a buffer to optimize priority queue AddWithOpts performance#3415

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
zach593:pq-buffer
Jan 16, 2026
Merged

✨ Use a buffer to optimize priority queue AddWithOpts performance#3415
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
zach593:pq-buffer

Conversation

@zach593
Copy link
Contributor

@zach593 zach593 commented Jan 11, 2026

before #3416, In the implementation of the priority queue, when there are a large number of items and many workers blocked in GetWithPriority(), calls to spin() can take a significant amount of time. This method shares the same lock with AddWithOpts(), which causes AddWithOpts() to spend 10 ms or more contending for the lock during writes.

Although AddWithOpts() supports enqueuing multiple items in a single call, the event handler as the caller passes only one item at a time. As a result, when the enqueue volume is high, a large number of items end up being blocked in the processorListener’s ring buffer.

After merging #3416, the time consumption of handleReadyItems() (i.e., spin()) has been significantly reduced, but due to the still existing lock contention issue, there is still room for optimization in the time consumption of AddWithOpts().

benchmark function:

func BenchmarkAddLockContended(b *testing.B) {
	q := New[int]("")
	defer q.ShutDown()

	for i := range 1000000 {
		q.Add(i)
	}

	for range 1000 {
		go func() {
			for {
				item, _ := q.Get()
				time.Sleep(1 * time.Millisecond)
				q.Done(item)
			}
		}()
	}

	for b.Loop() {
		for i := range 1000 {
			q.Add(i)
		}
	}
}

upstream(5de4c4f) with 1,000,000 preloaded items, 1,000 workers, 1ms reconcile cost:

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10

goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8   	   13590	    860324 ns/op
BenchmarkAddLockContended-8   	   12276	    962917 ns/op
BenchmarkAddLockContended-8   	   13140	    884829 ns/op
BenchmarkAddLockContended-8   	   10000	   1099948 ns/op
BenchmarkAddLockContended-8   	   12951	    957041 ns/op
BenchmarkAddLockContended-8   	   10384	   1097541 ns/op
BenchmarkAddLockContended-8   	   13532	    859918 ns/op
BenchmarkAddLockContended-8   	   12502	    933303 ns/op
BenchmarkAddLockContended-8   	   13098	    893137 ns/op
BenchmarkAddLockContended-8   	   13420	    898457 ns/op
PASS
ok  	sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue	119.886s

optimized:

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10

goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8   	  110494	     93792 ns/op
BenchmarkAddLockContended-8   	  118648	     84992 ns/op
BenchmarkAddLockContended-8   	  136893	     87722 ns/op
BenchmarkAddLockContended-8   	  139353	     92412 ns/op
BenchmarkAddLockContended-8   	  149298	     77878 ns/op
BenchmarkAddLockContended-8   	  170970	     81339 ns/op
BenchmarkAddLockContended-8   	  185634	     61435 ns/op
BenchmarkAddLockContended-8   	  163879	     80373 ns/op
BenchmarkAddLockContended-8   	  194013	     83871 ns/op
BenchmarkAddLockContended-8   	  222698	     76021 ns/op
PASS
ok  	sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue	136.549s

With the same benchmark setup, the optimized version brings BenchmarkAddLockContended() down from ~1 ms/op to ~80 µs/op, showing a 10–15× improvement under high contention.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 11, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @zach593. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 11, 2026
@alvaroaleman
Copy link
Member

alvaroaleman commented Jan 12, 2026

/ok-to-test

Sorry for the other pings, I didn't see this PR before.

Executing the provided benchmark on main takes more than a 100 seconds for me on my m2 pro, so I agree there is clearly a problem. It is a bit surprising that it takes almost seven times more on my device as opposed to what you got but that difference probably doesn't really matter much, bottom line is its too slow.

Using the provided benchmark to compare this with #3416, I get these results:

$ benchstat result-buffer.txt result-twotrees.txt
goos: darwin
goarch: arm64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: Apple M2 Pro
                    │ result-buffer.txt │            result-twotrees.txt            │
                    │      sec/op       │     sec/op      vs base                   │
AddLockContended-10        37.84µ ± 10%   4976.99µ ± 17%  +13052.54% (p=0.000 n=10)

Going further and comparing #3416 with the upstream workqueue, I get these results (worth noticing that the upstream q has a crazy high variance though, so something is weird there but I didn't manage to figure out what):

$ benchstat result-twotrees.txt ~/upstreamq.txt
goos: darwin
goarch: arm64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: Apple M2 Pro
                    │ result-twotrees.txt │    /Users/alvaro/ratelimitingq.txt     │
                    │       sec/op        │    sec/op      vs base                 │
AddLockContended-10          4.977m ± 17%   19.837m ± 97%  +298.58% (p=0.023 n=10)
Upstream diff for the benchmark
diff --git a/staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue_test.go b/staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue_test.go
index 9bf22c4b9ab..0cbc928ad57 100644
--- a/staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue_test.go
+++ b/staging/src/k8s.io/client-go/util/workqueue/rate_limiting_queue_test.go
@@ -23,6 +23,33 @@ import (
 	testingclock "k8s.io/utils/clock/testing"
 )
 
+func BenchmarkAddLockContended(b *testing.B) {
+	q := NewTypedWithConfig(TypedQueueConfig[int]{
+		MetricsProvider: noopMetricsProvider{},
+	})
+	defer q.ShutDown()
+
+	for i := range 1000000 {
+		q.Add(i)
+	}
+
+	for range 1000 {
+		go func() {
+			for {
+				item, _ := q.Get()
+				time.Sleep(1 * time.Millisecond)
+				q.Done(item)
+			}
+		}()
+	}
+
+	for b.Loop() {
+		for i := range 1000 {
+			q.Add(i)
+		}
+	}
+}
+
 func TestRateLimitingQueue(t *testing.T) {
 	limiter := NewItemExponentialFailureRateLimiter(1*time.Millisecond, 1*time.Second)
 	queue := NewRateLimitingQueue(limiter).(*rateLimitingType[any])

So #3416 seems to be slower than this, but still somewhat faster than upstream? I am mostly just bringing that up because I'd prefer if we first got that one in, we could still do a version of this after if you want, since its effectively just trading a bit of mem for reduced lock contention.

/cc @sbueringer

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 12, 2026
log logr.Logger

bufferLock sync.Mutex
buffer map[addOpts][]T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a slice that combins addOpts and key? That way we'd preserve order and it would simplify the code quite a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Using slices does simplify the code structure, but it comes at the cost of additional overhead, since each slice needs to carry AddOpts. Benchmarks show that introducing slices increases the overall cost by about 70%. However, given that the baseline latency is very low, this overhead should still be within an acceptable range.

Still with 1,000,000 preloaded items, 1,000 workers, 1ms reconcile cost.

use map:

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10
goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8       159739             72763 ns/op
BenchmarkAddLockContended-8       266252             42454 ns/op
BenchmarkAddLockContended-8       187652             74544 ns/op
BenchmarkAddLockContended-8       278227             43991 ns/op
BenchmarkAddLockContended-8       284910             42860 ns/op
BenchmarkAddLockContended-8       290059             43863 ns/op
BenchmarkAddLockContended-8       280312             42673 ns/op
BenchmarkAddLockContended-8       281852             41656 ns/op
BenchmarkAddLockContended-8       280782             46362 ns/op
BenchmarkAddLockContended-8       266382             47360 ns/op
PASS
ok      sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue     125.617s

use slice:

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10
goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8       153133             92869 ns/op
BenchmarkAddLockContended-8       163017            100041 ns/op
BenchmarkAddLockContended-8       214800             80015 ns/op
BenchmarkAddLockContended-8       150663             86073 ns/op
BenchmarkAddLockContended-8       213728             65746 ns/op
BenchmarkAddLockContended-8       145710             94048 ns/op
BenchmarkAddLockContended-8       195998             80824 ns/op
BenchmarkAddLockContended-8       132121             87749 ns/op
BenchmarkAddLockContended-8       165680             81779 ns/op
BenchmarkAddLockContended-8       144142             85884 ns/op
PASS
ok      sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue     155.121s

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 12, 2026
@zach593
Copy link
Contributor Author

zach593 commented Jan 12, 2026

@alvaroaleman Indeed, #3416 appears to address the cost of spin, since it no longer needs to traverse every item. However, this optimization still seems to provide a significant speedup on top of #3416, so it may still have some value. Since you’d like to handle #3416 first, please ping me after it’s merged. I’ll update this PR accordingly, and then we can review it again.

@sbueringer
Copy link
Member

@zach593 It is merged

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2026
@zach593 zach593 force-pushed the pq-buffer branch 2 times, most recently from f4f4ee2 to 2e3ff1a Compare January 15, 2026 17:37
@zach593
Copy link
Contributor Author

zach593 commented Jan 15, 2026

updated benchmark result:

test function:

func BenchmarkAddLockContended(b *testing.B) {
	q := New[int]("")
	defer q.ShutDown()

	for i := range 1000000 {
		q.Add(i)
	}

	for range 1000 {
		go func() {
			for {
				item, _ := q.Get()
				time.Sleep(1 * time.Millisecond)
				q.Done(item)
			}
		}()
	}

	for b.Loop() {
		for i := range 1000 {
			q.Add(i)
		}
	}
}

upstream(5de4c4f):

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10

goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8   	   13590	    860324 ns/op
BenchmarkAddLockContended-8   	   12276	    962917 ns/op
BenchmarkAddLockContended-8   	   13140	    884829 ns/op
BenchmarkAddLockContended-8   	   10000	   1099948 ns/op
BenchmarkAddLockContended-8   	   12951	    957041 ns/op
BenchmarkAddLockContended-8   	   10384	   1097541 ns/op
BenchmarkAddLockContended-8   	   13532	    859918 ns/op
BenchmarkAddLockContended-8   	   12502	    933303 ns/op
BenchmarkAddLockContended-8   	   13098	    893137 ns/op
BenchmarkAddLockContended-8   	   13420	    898457 ns/op
PASS
ok  	sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue	119.886s

optimized:

$ go test sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue -bench=BenchmarkAddLockContended -run=^$ -v -benchtime=10s -count=10

goos: linux
goarch: amd64
pkg: sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue
cpu: 12th Gen Intel(R) Core(TM) i9-12900K
BenchmarkAddLockContended
BenchmarkAddLockContended-8   	  110494	     93792 ns/op
BenchmarkAddLockContended-8   	  118648	     84992 ns/op
BenchmarkAddLockContended-8   	  136893	     87722 ns/op
BenchmarkAddLockContended-8   	  139353	     92412 ns/op
BenchmarkAddLockContended-8   	  149298	     77878 ns/op
BenchmarkAddLockContended-8   	  170970	     81339 ns/op
BenchmarkAddLockContended-8   	  185634	     61435 ns/op
BenchmarkAddLockContended-8   	  163879	     80373 ns/op
BenchmarkAddLockContended-8   	  194013	     83871 ns/op
BenchmarkAddLockContended-8   	  222698	     76021 ns/op
PASS
ok  	sigs.k8s.io/controller-runtime/pkg/controller/priorityqueue	136.549s

With the same benchmark setup, the optimized version brings BenchmarkAddLockContended() down from ~1 ms/op to ~80 µs/op, showing a 10–15× improvement under high contention.

// primarily for unit testing purposes.
// Successfully adding a ready item may result in an additional call to handleReadyItems(),
// but the cost is negligible.
w.flushAddBuffer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also do this in handleWaitingItems?

Copy link
Contributor Author

@zach593 zach593 Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it’s necessary. Items won’t stay in the buffer for long, because once an item enters the buffer, handleAddBuffer() will be triggered. There might be a tiny delay, but that should be negligible for items carrying a non zero after.

In fact, in most real usage scenarios, adding a flush in Len() and handleReadyItems() is also unnecessary, since everything runs in an asynchronous environment. The only reason to keep it is to avoid odd behavior in synchronous setups and to make unit tests simpler. Otherwise, we would have to write Eventually(q.addBufferLen).Should(BeZero()) everywhere.

Signed-off-by: zach593 <zach_li@outlook.com>
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

lgtm

@sbueringer any comments?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2026
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 3d6913c3326dac3a935c8948958cf72f49b549d0

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2026
@sbueringer
Copy link
Member

/assign

Will review later today

@sbueringer
Copy link
Member

@zach593 Please update the PR description. I think it's probably slightly outdated after #3416

@sbueringer
Copy link
Member

All good. Thank you very much!!

/hold cancel
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, sbueringer, zach593

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 137b9c0 into kubernetes-sigs:main Jan 16, 2026
11 checks passed
@sbueringer
Copy link
Member

@zach593 Thank you very much for the feedback and the optimizations!

@zach593
Copy link
Contributor Author

zach593 commented Jan 16, 2026

@zach593 Please update the PR description. I think it's probably slightly outdated after #3416

@sbueringer I just updated the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants