DefaultReconnectPolicy causes thundering herd in large rooms (650+ participants)

## Describe the problem

When a LiveKit server degrades under load in a large room (650+ participants), the `DefaultReconnectPolicy` causes all clients to reconnect simultaneously — a thundering herd that collapses the room instead of allowing recovery.

The root cause is in `DefaultReconnectPolicy.ts`:

```typescript
const DEFAULT_RETRY_DELAYS_IN_MS = [
  0,     // attempt 0: immediate — zero delay
  300,   // attempt 1: 300ms — zero jitter
  1200,  // attempt 2+: jitter added, but only Math.random() * 1_000
  2700, 4800, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  if (context.retryCount <= 1) return retryDelay; // NO jitter on first 2 attempts
  return retryDelay + Math.random() * 1_000;       // only 0-1000ms jitter after that
}
```

**The problems:**

1. **Attempt 0 has 0ms delay** — all clients reconnect instantly and simultaneously
2. **Attempt 1 has 300ms delay with no jitter** — all clients retry again at exactly the same time
3. **Jitter is only 0-1000ms** — for 650 clients, that's ~0.65 clients per millisecond hitting the server, which is not meaningful spread

We run self-hosted LiveKit (v1.9.12) with rooms of ~650 participants. We've had **5 production outages in 6 weeks** where this pattern plays out:

1. Server hits internal pressure (e.g., SDP renegotiation backlog, subscription binding timeout — see livekit/livekit#4112)
2. Server starts disconnecting participants
3. **All ~650 clients reconnect at 0ms (attempt 0) — 650 connections/second**
4. Server buckles under the reconnection load, disconnecting more participants
5. Attempt 1 fires at 300ms — another 650 simultaneous reconnects
6. Positive feedback loop → room collapses from 658 → 6 participants in ~30 seconds
7. Participants that get pushed to the other node via Redis trigger the same cascade there

The reconnection storm is the **amplifier** that turns a degraded-but-recoverable state into a total collapse.

## Describe the proposed solution

Add meaningful jitter starting from the **very first reconnect attempt**. When a server is struggling, spreading reconnections over 5-15 seconds instead of 0ms gives it time to recover.

Suggested change to `DefaultReconnectPolicy`:

```typescript
const DEFAULT_RETRY_DELAYS_IN_MS = [
  2000,   // attempt 0: 2s base (was 0)
  3000,   // attempt 1: 3s base (was 300)
  5000,   // attempt 2: 5s base (was 1200)
  7000,   // attempt 3+: same as current
  7000, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  // Jitter on ALL attempts, proportional to delay (±50%)
  const jitter = retryDelay * (Math.random() - 0.5);
  return Math.max(0, Math.round(retryDelay + jitter));
}
```

This would spread 650 clients' first reconnect attempt over a **1-3 second window** (~220-650 clients/sec) instead of all at 0ms, and subsequent retries over proportionally wider windows.

The exact values are less important than the principles:
- **Every attempt should have jitter**, including the first
- **Jitter should scale with the delay**, not be a fixed 0-1000ms
- **The first attempt should not be 0ms** — even 1-3 seconds of spread prevents the thundering herd

## Alternatives considered

**1. Custom `reconnectPolicy` in application code (our current workaround)**

We can pass a custom `reconnectPolicy` in `RoomOptions` to add jitter ourselves. This works, but:
- The unsafe defaults still affect every other LiveKit deployment
- Most self-hosters won't know they need this until they have their first large-room outage
- The SDK should be safe by default at scale

**2. Server-side reconnection backoff (livekit/livekit)**

The server could stagger disconnect signals or add backoff to `could not restart participant` rejections. This would help but doesn't address the client-side thundering herd from network-level disconnects (where the server isn't choosing to disconnect clients).

**3. Larger fixed jitter (e.g., `Math.random() * 10_000`)**

Simpler but less elegant — a fixed 0-10s jitter would work for large rooms but adds unnecessary latency for small rooms or transient network blips where instant reconnection is appropriate.

## Importance

serious, but I can work around it

## Additional Information

- **SDK version**: livekit-client 2.17.3
- **Server version**: LiveKit 1.9.12 (self-hosted, bare metal, 2x 128-core nodes)
- **Room size**: 500-900 participants
- **Related server issues**: livekit/livekit#4112 (subscription binding timeout), livekit/livekit#3475 (DataChannel abort cascade)
- **Observed failure**: 658 → 6 participants in ~30 seconds across 5 separate incidents (Feb-Mar 2026)
- **Our workaround**: Custom `reconnectPolicy` with exponential backoff + proportional jitter from first attempt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DefaultReconnectPolicy causes thundering herd in large rooms (650+ participants) #1852

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

DefaultReconnectPolicy causes thundering herd in large rooms (650+ participants) #1852

Description

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions