dedicated event etcd draft PoC for ShiftWeek by tjungblu · Pull Request #1505 · openshift/cluster-etcd-operator

tjungblu · 2025-10-27T11:55:36Z

/hold

just here for CI runs and cluster bot builds

openshift-ci · 2025-10-27T11:55:40Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-10-27T11:55:53Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-10-27T11:55:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tjungblu · 2025-10-30T11:53:29Z

Some quick benchmark results using:

etcdctl check perf --load="xl"

This is running against the normal etcd:

export ETCDCTL_ENDPOINTS="https://localhost:2379"
etcdctl check perf --load="xl" 
FAIL: Throughput too low: 8097 writes/s
PASS: Slowest request took 0.077356s
PASS: Stddev is 0.004320s
FAIL

This is running against localhost in-memory:

export ETCDCTL_ENDPOINTS="https://localhost:20379"
etcdctl check perf --load="xl"
... 
FAIL: Throughput too low: 11476 writes/s
PASS: Slowest request took 0.068861s
PASS: Stddev is 0.002417s

Now this is pretty crappy, let's try some tuning:

unsafe-no-fsync=true

FAIL: Throughput too low: 11407 writes/s
PASS: Slowest request took 0.061991s
PASS: Stddev is 0.001785s

seems has no effect

--backend-batch-limit=50000 and --backend-batch-interval=1m up from 10000 / 100ms

FAIL: Throughput too low: 11438 writes/s
PASS: Slowest request took 0.082044s
PASS: Stddev is 0.002382s

also no real effect, leading me to believe this benchmark is not really disk bound in the first place.

Checking the allocation route with:

--backend-bbolt-freelist-type=array instead of the hashmap implementation yields

FAIL: Throughput too low: 11630 writes/s
PASS: Slowest request took 0.066573s
PASS: Stddev is 0.001428s

Performs only slightly better.

After CPU profiling, some interesting findings:

most of the time is spent on netFD.write syscall, flushing the response of etcd to the network
there is significant contention in the .Put and processInternalRaftRequestOnce call chain,
fsync just is 0.9%, also matching what we saw above
the apply loop is ~6%, matching the results when changing the batch settings

Unfortunately the grpc options for buffer sizes (R+W are at 32K each) need recompilation, so I won't be able to max this out any further today.

tjungblu · 2025-10-31T10:23:40Z

pkg/operator/dedicatedetcdcontroller/dedicated_etcd_controller.go

+		  --- OR via SVC, as done below ---
+			 - "/events#https://events-etcd.openshift-etcd.svc:20379"


this requires dnsPolicy=ClusterFirstWithHostNet on the kube-apiserver static pods

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

This PR contains a dedicated in-memory etcd deployment that will run on one control plane host and configures the kube-apiserver to send events to it. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

openshift-bot · 2026-01-30T01:00:47Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2026-03-01T08:30:59Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 27, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2025

tjungblu force-pushed the RFE-7051 branch 3 times, most recently from 5920b8e to 689045c Compare October 29, 2025 15:58

tjungblu commented Oct 31, 2025

View reviewed changes

RFE-7051: add unsupported dedicated events etcd

91f85d9

This PR contains a dedicated in-memory etcd deployment that will run on one control plane host and configures the kube-apiserver to send events to it. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

tjungblu force-pushed the RFE-7051 branch from 689045c to 91f85d9 Compare October 31, 2025 11:05

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2026

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dedicated event etcd draft PoC for ShiftWeek#1505

dedicated event etcd draft PoC for ShiftWeek#1505
tjungblu wants to merge 1 commit intoopenshift:mainfrom
tjungblu:RFE-7051

tjungblu commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

Review skipped

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

tjungblu commented Oct 30, 2025 •

edited

Loading

Uh oh!

tjungblu Oct 31, 2025

Uh oh!

openshift-bot commented Jan 30, 2026

Uh oh!

openshift-bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		--- OR via SVC, as done below ---
		- "/events#https://events-etcd.openshift-etcd.svc:20379"

Conversation

tjungblu commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

coderabbitai bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

tjungblu commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjungblu Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-bot commented Jan 30, 2026

Uh oh!

openshift-bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

tjungblu commented Oct 30, 2025 •

edited

Loading