-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Description
Description
There are two related correctness issues in the continuous batching result consumption logic that can lead to unfairness and non-terminating iterators under concurrent workloads.
1. Starvation and incorrect timeout handling in get_result
ContinuousBatchingManager.get_result currently retrieves a single item from the shared output queue and immediately re-queues it if the request_id does not match, returning None afterward. Under concurrent requests, this can lead to:
- Starvation when mismatched outputs are repeatedly re-queued
- Timeout semantics that are not respected, since re-queueing returns early instead of continuing to search within the remaining timeout
- Unfair consumption behavior that depends on queue ordering rather than request progress
This behavior is observable when multiple streaming requests are active and results are interleaved in the output queue.
2. request_id_iter does not terminate after normal completion
request_id_iter currently exits only when a request is cancelled or the generation thread terminates. For requests that complete normally, the iterator continues polling indefinitely after the final FINISHED output has been yielded.
This can result in:
- Infinite iteration loops for request-scoped consumers
- Unexpected blocking behavior in streaming-style usage
- Reliance on caller-side logic to manually stop iteration
The iterator should terminate once a terminal FINISHED output is observed for the given request.
Expected behavior
get_resultshould fairly search for a matching result within the specified timeout without starvation or early return.request_id_itershould stop iterating once the request reaches a terminal finished state, in addition to cancellation or thread termination.
Proposed fix
A minimal, backward-compatible fix (#42942) can:
- Defer re-queuing mismatched outputs until a matching result is found or the timeout expires
- Explicitly terminate
request_id_iterwhen aGenerationOutputreports a finished state
This preserves existing APIs, streaming semantics, and benchmarking behavior while fixing the correctness issues.
Environment
- Transformers version: main
- Feature: continuous batching
- Device: CPU / CUDA (independent of backend)
Additional context
These issues are easiest to reproduce with multiple concurrent streaming requests sharing a single ContinuousBatchingManager.