Skip to content

[fix][client] PIP-475: make async producer survive regular-to-scalable migration#25882

Open
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:pip-475-async-send-fix
Open

[fix][client] PIP-475: make async producer survive regular-to-scalable migration#25882
merlimat wants to merge 1 commit into
apache:masterfrom
merlimat:pip-475-async-send-fix

Conversation

@merlimat
Copy link
Copy Markdown
Contributor

Motivation

Follow-up to the PIP-475 migration work. #25878 hardened the sync producer send path so a connected V5 producer rides through the synthetic→real-DAG transition during a regular-to-scalable migration: when a per-segment producer is (re)created on a partition that the migration just terminated, the send retries onto an active child once the new layout arrives instead of failing.

The async path (producer.async()…send()dispatchSendAttempt / appendToDispatchChain) had the same gap and was missed, so an application using the async producer API and sending across a migration could still see sends fail.

Modifications

  • Creation-time failures are now retried. appendToDispatchChain previously failed the user future immediately when the per-segment producer failed to create; it now hands the failure to a callback that applies the same retry decision as a send failure.
  • Unified, robust detection. Both the v4 sendAsync failure and the producer-creation failure now go through the shared isSegmentGoneError (type or message match), instead of the async path's previous type-only check that missed the message-wrapped exception the creation path surfaces.
  • Full retry budget. Both paths now use SEND_RETRY_MAX_ATTEMPTS / SEND_RETRY_MAX_BACKOFF_MS (shared with the sync path) instead of the old 3-attempt (~600 ms) limit.

No wire/API/behavior change for the happy path; this only affects retry behavior on the segment-gone transition.

Verification

  • New V5MigrationEndToEndTest.testV5AsyncProducerSurvivesMigration: async sends issued immediately after migration must all complete (none fail across the boundary) and every pre-/post-migration message stays consumable. Passes alongside the existing sync + v4-lockout cases (3/3).
  • pulsar-client-v5 checkstyle/compile and pulsar-broker checkstyle clean.

…e migration

The sync send path was hardened in apache#25878 to ride through the
synthetic->real-DAG transition (retry the segment-gone failure that
arises when the per-segment producer is (re)created on a just-terminated
partition). The async path (producer.async()...send() ->
dispatchSendAttempt / appendToDispatchChain) had the same gap and was
missed:
 * creation-time failures were not retried at all — appendToDispatchChain
   failed the user future immediately;
 * the send-failure retry matched only the typed TopicTerminated /
   AlreadyClosed exceptions, not the message-wrapped form the creation
   path surfaces;
 * it allowed only 3 attempts (~600ms), often too short for the layout
   push to arrive.

Route both send- and creation-failures through the shared
isSegmentGoneError detection and the SEND_RETRY_MAX_ATTEMPTS /
SEND_RETRY_MAX_BACKOFF_MS budget already used by the sync path, via a
single handleAsyncSegmentFailure helper. appendToDispatchChain now hands
creation failures to a callback instead of failing the user future
directly.

Adds V5MigrationEndToEndTest.testV5AsyncProducerSurvivesMigration:
async sends issued right after migration must all complete (none fail
across the boundary) and every message stays consumable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant