Replace the existing stale container when adding work container #909

cserspring · 2026-01-22T23:09:03Z

Issue description:

We recently got an issue in our usage of the parallel consumer library. During one period, the size of the consumer group fluctuated repeatedly (12 → 13 → 12 → 13 → 12).
From the logs, the consumer received and completed the record at offset 12650437055. There are no “Received/Completed” entries for 12650437056, yet records at 12650437057, 12650437058, 12650437059, … were all received and completed. During this period, the committed offset for the partition remained at 12650437056 (per metrics), which prevented commits from advancing and created lag on the partition.
After ~14 minutes, we restarted one pod. Immediately afterward, the consumer sought to the committed offset 12650437056, then received and completed that record, and subsequent commits advanced as normal.

This happen rarely in the prod env, it occurred once per months.

After we checked the code, there might be an issue when adding work container.

Let me share the linear timeline for this potential race condition bug. Please let me know if it makes sense.

Time    Thread              Action
-----   ----------          ----------

T1      Broker Poll         poll() returns batch [12650437055 (K_A), 12650437056 (K_B), ...]
T2      Broker Poll         Creates EpochAndRecordsMap(epoch=N), adds to mailbox

T3      Control             Drains mailbox, starts processing batch (epoch=N)
T4      Control             Epoch check: N == N ✓ (passes!)
T5      Control               └── addWorkContainer(12650437055, K_A, epoch=N)
                                  Shard K_A CREATED, entry added

        ------------ REBALANCE HAPPENS HERE (on Broker Poll thread) -------------

T6      Broker Poll         onPartitionsRevoked()
T7      Broker Poll           └── epoch: N → N+1
T8      Broker Poll           └── removeStaleContainers()
                                  Shard K_A has stale entry (epoch N) → REMOVED ✓
                                  Shard K_B does not exist yet → nothing to remove

T9      Broker Poll         onPartitionsAssigned()
T10     Broker Poll           └── epoch: N+1 → N+2
T11     Broker Poll           └── removeStaleContainers()
                                  Shard K_A is now empty
                                  Shard K_B still does not exist

        ------------- CONTROL THREAD CONTINUES (still processing old batch) ------------

T12     Control               └── addWorkContainer(12650437056, K_B, epoch=N)
                                  Shard K_B CREATED, stale entry added
                                  (This happens AFTER removeStaleContainers!)

T13     Broker Poll         poll() returns new batch [12650437055, 12650437056, 12650437057, ...]
T14     Broker Poll         Creates EpochAndRecordsMap(epoch=N+2), adds to mailbox

T15     Control             Finishes old batch
T16     Control             Drains mailbox, processes new batch (epoch=N+2)
T17     Control             Epoch check: N+2 == N+2 ✓

T18     Control               └── addWorkContainer(12650437055, K_A, epoch=N+2)
                                  Shard K_A is EMPTY (stale removed at T8) → ADDED ✓

T19     Control               └── addWorkContainer(12650437056, K_B, epoch=N+2)
                                  Shard K_B has stale entry from T12 → DROPPED! ✗

T20     Control               └── addWorkContainer(12650437057, K_C, epoch=N+2)
                                  Shard K_C never existed → ADDED ✓

@rkolesnev @johnbyrnejb @sangreal @eddyv could you please review this PR?

Checklist

Documentation (if applicable)
Changelog

cserspring requested a review from a team as a code owner January 22, 2026 23:09

Replace the existing stale container when adding work container

57596a5

cserspring force-pushed the usr/cserspring/addWorkContainer_fix branch from 78813c2 to 57596a5 Compare January 22, 2026 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the existing stale container when adding work container #909

Replace the existing stale container when adding work container #909

Uh oh!

cserspring commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Replace the existing stale container when adding work container #909

Are you sure you want to change the base?

Replace the existing stale container when adding work container #909

Uh oh!

Conversation

cserspring commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue description:

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cserspring commented Jan 22, 2026 •

edited

Loading