Skip to content

DefaultReconnectPolicy causes thundering herd in large rooms (650+ participants) #1852

@ray-amjad

Description

@ray-amjad

Describe the problem

When a LiveKit server degrades under load in a large room (650+ participants), the DefaultReconnectPolicy causes all clients to reconnect simultaneously — a thundering herd that collapses the room instead of allowing recovery.

The root cause is in DefaultReconnectPolicy.ts:

const DEFAULT_RETRY_DELAYS_IN_MS = [
  0,     // attempt 0: immediate — zero delay
  300,   // attempt 1: 300ms — zero jitter
  1200,  // attempt 2+: jitter added, but only Math.random() * 1_000
  2700, 4800, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  if (context.retryCount <= 1) return retryDelay; // NO jitter on first 2 attempts
  return retryDelay + Math.random() * 1_000;       // only 0-1000ms jitter after that
}

The problems:

  1. Attempt 0 has 0ms delay — all clients reconnect instantly and simultaneously
  2. Attempt 1 has 300ms delay with no jitter — all clients retry again at exactly the same time
  3. Jitter is only 0-1000ms — for 650 clients, that's ~0.65 clients per millisecond hitting the server, which is not meaningful spread

We run self-hosted LiveKit (v1.9.12) with rooms of ~650 participants. We've had 5 production outages in 6 weeks where this pattern plays out:

  1. Server hits internal pressure (e.g., SDP renegotiation backlog, subscription binding timeout — see reconnecting when others pull track in the room livekit#4112)
  2. Server starts disconnecting participants
  3. All ~650 clients reconnect at 0ms (attempt 0) — 650 connections/second
  4. Server buckles under the reconnection load, disconnecting more participants
  5. Attempt 1 fires at 300ms — another 650 simultaneous reconnects
  6. Positive feedback loop → room collapses from 658 → 6 participants in ~30 seconds
  7. Participants that get pushed to the other node via Redis trigger the same cascade there

The reconnection storm is the amplifier that turns a degraded-but-recoverable state into a total collapse.

Describe the proposed solution

Add meaningful jitter starting from the very first reconnect attempt. When a server is struggling, spreading reconnections over 5-15 seconds instead of 0ms gives it time to recover.

Suggested change to DefaultReconnectPolicy:

const DEFAULT_RETRY_DELAYS_IN_MS = [
  2000,   // attempt 0: 2s base (was 0)
  3000,   // attempt 1: 3s base (was 300)
  5000,   // attempt 2: 5s base (was 1200)
  7000,   // attempt 3+: same as current
  7000, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  // Jitter on ALL attempts, proportional to delay (±50%)
  const jitter = retryDelay * (Math.random() - 0.5);
  return Math.max(0, Math.round(retryDelay + jitter));
}

This would spread 650 clients' first reconnect attempt over a 1-3 second window (~220-650 clients/sec) instead of all at 0ms, and subsequent retries over proportionally wider windows.

The exact values are less important than the principles:

  • Every attempt should have jitter, including the first
  • Jitter should scale with the delay, not be a fixed 0-1000ms
  • The first attempt should not be 0ms — even 1-3 seconds of spread prevents the thundering herd

Alternatives considered

1. Custom reconnectPolicy in application code (our current workaround)

We can pass a custom reconnectPolicy in RoomOptions to add jitter ourselves. This works, but:

  • The unsafe defaults still affect every other LiveKit deployment
  • Most self-hosters won't know they need this until they have their first large-room outage
  • The SDK should be safe by default at scale

2. Server-side reconnection backoff (livekit/livekit)

The server could stagger disconnect signals or add backoff to could not restart participant rejections. This would help but doesn't address the client-side thundering herd from network-level disconnects (where the server isn't choosing to disconnect clients).

3. Larger fixed jitter (e.g., Math.random() * 10_000)

Simpler but less elegant — a fixed 0-10s jitter would work for large rooms but adds unnecessary latency for small rooms or transient network blips where instant reconnection is appropriate.

Importance

serious, but I can work around it

Additional Information

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions