Skip to content

Conversation

@cserspring
Copy link

@cserspring cserspring commented Jan 22, 2026

Issue description:

We recently got an issue in our usage of the parallel consumer library. During one period, the size of the consumer group fluctuated repeatedly (12 → 13 → 12 → 13 → 12).
From the logs, the consumer received and completed the record at offset 12650437055. There are no “Received/Completed” entries for 12650437056, yet records at 12650437057, 12650437058, 12650437059, … were all received and completed. During this period, the committed offset for the partition remained at 12650437056 (per metrics), which prevented commits from advancing and created lag on the partition.
After ~14 minutes, we restarted one pod. Immediately afterward, the consumer sought to the committed offset 12650437056, then received and completed that record, and subsequent commits advanced as normal.

This happen rarely in the prod env, it occurred once per months.

After we checked the code, there might be an issue when adding work container.

Let me share the linear timeline for this potential race condition bug. Please let me know if it makes sense.

Time    Thread              Action
-----   ----------          ----------

T1      Broker Poll         poll() returns batch [12650437055 (K_A), 12650437056 (K_B), ...]
T2      Broker Poll         Creates EpochAndRecordsMap(epoch=N), adds to mailbox

T3      Control             Drains mailbox, starts processing batch (epoch=N)
T4      Control             Epoch check: N == N ✓ (passes!)
T5      Control               └── addWorkContainer(12650437055, K_A, epoch=N)
                                  Shard K_A CREATED, entry added

        ------------ REBALANCE HAPPENS HERE (on Broker Poll thread) -------------

T6      Broker Poll         onPartitionsRevoked()
T7      Broker Poll           └── epoch: N → N+1
T8      Broker Poll           └── removeStaleContainers()
                                  Shard K_A has stale entry (epoch N) → REMOVED ✓
                                  Shard K_B does not exist yet → nothing to remove

T9      Broker Poll         onPartitionsAssigned()
T10     Broker Poll           └── epoch: N+1 → N+2
T11     Broker Poll           └── removeStaleContainers()
                                  Shard K_A is now empty
                                  Shard K_B still does not exist

        ------------- CONTROL THREAD CONTINUES (still processing old batch) ------------

T12     Control               └── addWorkContainer(12650437056, K_B, epoch=N)
                                  Shard K_B CREATED, stale entry added
                                  (This happens AFTER removeStaleContainers!)

T13     Broker Poll         poll() returns new batch [12650437055, 12650437056, 12650437057, ...]
T14     Broker Poll         Creates EpochAndRecordsMap(epoch=N+2), adds to mailbox

T15     Control             Finishes old batch
T16     Control             Drains mailbox, processes new batch (epoch=N+2)
T17     Control             Epoch check: N+2 == N+2 ✓

T18     Control               └── addWorkContainer(12650437055, K_A, epoch=N+2)
                                  Shard K_A is EMPTY (stale removed at T8) → ADDED ✓

T19     Control               └── addWorkContainer(12650437056, K_B, epoch=N+2)
                                  Shard K_B has stale entry from T12 → DROPPED! ✗

T20     Control               └── addWorkContainer(12650437057, K_C, epoch=N+2)
                                  Shard K_C never existed → ADDED ✓

@rkolesnev @johnbyrnejb @sangreal @eddyv could you please review this PR?

Checklist

  • Documentation (if applicable)
  • Changelog

@cserspring cserspring requested a review from a team as a code owner January 22, 2026 23:09
@cserspring cserspring force-pushed the usr/cserspring/addWorkContainer_fix branch from 78813c2 to 57596a5 Compare January 22, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant