Skip to content

Conversation

@rokk4
Copy link

@rokk4 rokk4 commented Dec 2, 2025

No description provided.

rokk4 added 3 commits December 2, 2025 18:32
Fixes race condition where tracks arriving before participant metadata
were permanently dropped from the pending queue after timeout, causing
10-60 second delays or complete failures when participants rejoin.

Changes:
1. Retry transient failures: Modified _flushPendingTracks() to differentiate
   between transient (notTrackMetadataFound) and permanent failures. Transient
   failures now keep tracks in queue for retry instead of removing them.

2. Additional flush trigger: Added listener to flush pending tracks when
   SignalParticipantUpdateEvent contains track publications, ensuring tracks
   are subscribed once metadata becomes available.

3. Improved logging: Transient failures logged at fine level to reduce noise,
   permanent failures at severe level for visibility.

The fix maintains the existing timeout configuration from connectOptions
while enabling retry logic that resolves the race condition where:
- WebRTC track arrives first → queued
- ParticipantInfo arrives → participant created → flush fails (no publications)
- TrackPublishedResponse arrives later → second flush succeeds

This reduces track subscription latency after rejoin from 10-60s to <1s
and improves reliability on slower devices where the race condition was
more pronounced.

Related: livekit#928
… logic

Combines defensive and reactive approaches to fix race condition where tracks
arriving before participant metadata caused 10-60s delays or failures on rejoin.

Root Cause:
When a participant rejoins, WebRTC tracks can arrive before signaling metadata.
The previous logic had three critical gaps:
1. Tracks queued but dropped on timeout (no retry)
2. Missing flush triggers when metadata finally arrives
3. Insufficient deferral check (only participant existence, not publication)

Solution - Three-Layer Defense:

1. PREVENTIVE: Enhanced deferral logic (NEW)
   Check not just participant existence, but also publication metadata:
   - connectionState != connected (pre-connection tracks)
   - participant == null (tracks before participant)
   - publication == null (tracks before metadata) ← NEW CHECK

   This prevents premature subscription attempts that would timeout.

2. REACTIVE: Retry transient failures
   Modified _flushPendingTracks() to differentiate failure types:
   - notTrackMetadataFound → return false (keep in queue, retry)
   - Other failures → return true (remove from queue)

   Handles micro-timing races where flush happens before metadata processed.

3. AGGRESSIVE: Additional flush trigger
   Added SignalParticipantUpdateEvent listener to flush when track
   publications arrive, ensuring queued tracks are processed promptly.

Impact:
- Reduces rejoin latency from 10-60s to <1s
- Eliminates frozen frames on rejoin
- More robust on slower devices (reduced CPU-dependent timing sensitivity)
- Maintains configurable timeout from connectOptions

The combined approach is superior because:
- Prevention reduces unnecessary timeout waits
- Retry ensures recovery from edge cases
- Aggressive flush ensures timely processing
- Event-driven design scales better than polling

Related: livekit#928
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@rokk4 rokk4 marked this pull request as draft December 2, 2025 22:15
@rokk4 rokk4 changed the title Rokk4/rejoin track freeze fix WIP: rejoin track freeze fix Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants