feat: controller sharding #1

zachsmith1 · 2025-08-23T06:46:34Z

What

Deterministic sharding via HRW (rendezvous) over live peers.
Peer leases mcr-peer-* (membership/weights) + shard leases mcr-shard- (per-cluster fencing, single writer).
Clean handoff: watches detach/re-attach on ownership change.
Rebalance on scale up/down.

Why

Prevent double reconciles & cold-start stampede.
Balanced, deterministic ownership with fast failover.

How to try

Follow the README (examples/sharded-namespace) for build/deploy/observe steps.

Changes

Manager: HRW + per-cluster Lease fencing, peer/fence prefix split.
Controller: per-cluster engagement context, re-engage fix.
Source: removable handler, re-register on new ctx.

pkg/controller/controller.go

pkg/manager/leaseguard.go

pkg/manager/manager.go

pkg/manager/peers/registry.go

pkg/manager/sharder/hrw.go

pkg/manager/sharder/sharder.go

pkg/source/kind.go

pkg/manager/manager.go

ntnn · 2025-08-23T10:27:24Z

pkg/manager/manager.go

+		c()
+	}
+	for _, s := range starts {
+		go m.startForCluster(s.ctx, s.name, s.cl)


Should this happen in goroutines?

If there is the possibility of runnables blocking so long that this can become a problem then they could also cancel contexts of a newer tick because the access the .cancel of what is currently stored.
Actually the engaged would still be .started=true so there shouldn't be a second tick starting it.

Maybe wrap them in a wait group with a timeout, if runnables with long engagement times are a concern?

I think we might be okay because:

we make the ownership decision under the lock, set started=true, and capture the specific cancel chosen there; that prevents double-starts and ensures any stop only targets the intended context;

starts run in goroutines by design so a slow cluster doesn’t stall the tick; they’re based on the captured ctx, not whatever is current later;

on re-engage we swap the engagement token (and cancel the old one), and cleanup only deletes if the token still matches — so a late cancel from the old ctx can’t remove the new engagement;

actual work is still gated by the fence lease; even if a delayed start tried to run, it wouldn’t pass TryAcquire.

If we ever see edge cases, we can add a lightweight token check in startForCluster (drop stale starts) or an optional engage timeout, but the current shape should be safe and responsive.

pkg/manager/manager.go

Co-authored-by: Nelo-T. Wallus <10514301+ntnn@users.noreply.github.com>

embik

I have some more comments probably later, but these are the first few things I wanted to bring up.

pkg/manager/leaseguard.go

embik · 2025-08-28T09:07:17Z

pkg/manager/leaseguard.go

+	case err != nil:
+		return false


Thinking out loud here, should we log errors here? I'm a little uncomfortable completely discarding them, but this would be debug logging at best I suppose.

Errors here are expected (conflicts/transient API issues) and this path is hot, so logging in TryAcquire would be noisy. I’d keep it silent and, if we want visibility, add a low-verbosity, rate-limited log (and/or metric) in the caller when TryAcquire returns false. We already throttle retries in the engine, so a V(2)+ log there would be controlled.

embik · 2025-08-28T09:16:06Z

pkg/manager/ownership.go

+	// PerClusterLease controls whether we create one fence per cluster (true) or a
+	// single shared fence name (false). Per-cluster is recommended.
+	PerClusterLease bool


I wonder if this makes sense, isn't a single fence basically the same as running a global leader election?

Not sure i'm following. Single fence (PerClusterLease = False) is one lease for the whole cluster fleet. One peer would manage all downstream clusters

Yes, exactly. Is there a noticeable difference to using the built-in leader election of the manager if it's a "winner takes it all" kind of situation?

Good question—PerClusterLease=false does behave like “winner takes all,” but it’s scoped to the multicluster engine and integrates with our peer registry (IDs/weights) and lease naming, so you can later flip to per-cluster fencing without changing your wiring. Built-in manager leader election is great if you truly want a single active process for everything; the single-fence option is for MC-only exclusivity, peer-aware behavior, and an easy path to sharding later. We default to per-cluster; single-fence remains an advanced, opt-in mode.

Co-authored-by: Marvin Beckers <mail@embik.me>

sttts · 2025-08-29T06:54:25Z

pkg/manager/leaseguard.go

+	"sigs.k8s.io/controller-runtime/pkg/client"
+)
+
+// leaseGuard manages a single Lease as a *fence* for one shard/cluster.


@embik wondering, how does this scale with O(1000) workspaces in case of kcp? 1000 leases gives a lot of server traffic to renew.

In other words, maybe should be pluggable? Thinking of doing load-balancing in buckets for example.

This doesn't have to happen in this MR as long as there is some transition we can come up with. Now, this lease logic is active automatically for every controller?

pkg/manager/leaseguard_test.go

pkg/manager/manager.go

pkg/manager/ownership.go

pkg/manager/peers/registry.go

pkg/manager/sharding.go

mjudeikis

Just nit. But will defer to @sttts for full review.

pkg/manager/leaseguard.go

zachsmith1 added 2 commits August 22, 2025 23:43

feat: controller sharding

f3ea62f

feat: controller sharding

1db63bc