Skip to content

Conversation

@zachsmith1
Copy link
Owner

@zachsmith1 zachsmith1 commented Aug 23, 2025

What

  • Deterministic sharding via HRW (rendezvous) over live peers.
  • Peer leases mcr-peer-* (membership/weights) + shard leases mcr-shard- (per-cluster fencing, single writer).
  • Clean handoff: watches detach/re-attach on ownership change.
  • Rebalance on scale up/down.

Why

  • Prevent double reconciles & cold-start stampede.
  • Balanced, deterministic ownership with fast failover.

How to try

Follow the README (examples/sharded-namespace) for build/deploy/observe steps.

Changes

  • Manager: HRW + per-cluster Lease fencing, peer/fence prefix split.
  • Controller: per-cluster engagement context, re-engage fix.
  • Source: removable handler, re-register on new ctx.

c()
}
for _, s := range starts {
go m.startForCluster(s.ctx, s.name, s.cl)
Copy link

@ntnn ntnn Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this happen in goroutines?

If there is the possibility of runnables blocking so long that this can become a problem then they could also cancel contexts of a newer tick because the access the .cancel of what is currently stored.
Actually the engaged would still be .started=true so there shouldn't be a second tick starting it.

Maybe wrap them in a wait group with a timeout, if runnables with long engagement times are a concern?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be okay because:

  • we make the ownership decision under the lock, set started=true, and capture the specific cancel chosen there; that prevents double-starts and ensures any stop only targets the intended context;
  • starts run in goroutines by design so a slow cluster doesn’t stall the tick; they’re based on the captured ctx, not whatever is current later;
  • on re-engage we swap the engagement token (and cancel the old one), and cleanup only deletes if the token still matches — so a late cancel from the old ctx can’t remove the new engagement;
  • actual work is still gated by the fence lease; even if a delayed start tried to run, it wouldn’t pass TryAcquire.

If we ever see edge cases, we can add a lightweight token check in startForCluster (drop stale starts) or an optional engage timeout, but the current shape should be safe and responsive.

Copy link

@embik embik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some more comments probably later, but these are the first few things I wanted to bring up.

Comment on lines +103 to +104
case err != nil:
return false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud here, should we log errors here? I'm a little uncomfortable completely discarding them, but this would be debug logging at best I suppose.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors here are expected (conflicts/transient API issues) and this path is hot, so logging in TryAcquire would be noisy. I’d keep it silent and, if we want visibility, add a low-verbosity, rate-limited log (and/or metric) in the caller when TryAcquire returns false. We already throttle retries in the engine, so a V(2)+ log there would be controlled.

Comment on lines 60 to 62
// PerClusterLease controls whether we create one fence per cluster (true) or a
// single shared fence name (false). Per-cluster is recommended.
PerClusterLease bool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this makes sense, isn't a single fence basically the same as running a global leader election?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure i'm following. Single fence (PerClusterLease = False) is one lease for the whole cluster fleet. One peer would manage all downstream clusters

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Is there a noticeable difference to using the built-in leader election of the manager if it's a "winner takes it all" kind of situation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question—PerClusterLease=false does behave like “winner takes all,” but it’s scoped to the multicluster engine and integrates with our peer registry (IDs/weights) and lease naming, so you can later flip to per-cluster fencing without changing your wiring. Built-in manager leader election is great if you truly want a single active process for everything; the single-fence option is for MC-only exclusivity, peer-aware behavior, and an easy path to sharding later. We default to per-cluster; single-fence remains an advanced, opt-in mode.

Co-authored-by: Marvin Beckers <mail@embik.me>
"sigs.k8s.io/controller-runtime/pkg/client"
)

// leaseGuard manages a single Lease as a *fence* for one shard/cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embik wondering, how does this scale with O(1000) workspaces in case of kcp? 1000 leases gives a lot of server traffic to renew.

In other words, maybe should be pluggable? Thinking of doing load-balancing in buckets for example.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have to happen in this MR as long as there is some transition we can come up with. Now, this lease logic is active automatically for every controller?

Copy link

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nit. But will defer to @sttts for full review.

@zachsmith1 zachsmith1 merged commit 5bd7f9b into main Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants