Skip to content

kgo-verifier: default metadata-max-age to 15s (was franz-go's 5m)#30654

Open
travisdowns wants to merge 1 commit into
redpanda-data:devfrom
travisdowns:td-CORE-13458-kgo-metadata-max-age
Open

kgo-verifier: default metadata-max-age to 15s (was franz-go's 5m)#30654
travisdowns wants to merge 1 commit into
redpanda-data:devfrom
travisdowns:td-CORE-13458-kgo-metadata-max-age

Conversation

@travisdowns
Copy link
Copy Markdown
Member

@travisdowns travisdowns commented May 29, 2026

WriteCachingFailureInjectionE2ETest.test_crash_all is a long-standing flake (CORE-13458): it times out in roughly 30% of local runs and recurs daily in CI.

The root cause is in the franz-go Kafka client used by kgo-verifier, not in redpanda. The test crashes all brokers between produce rounds; with write caching the unflushed tail is lost and the log is rewritten at the same offsets under a new leader epoch, so the continuous consumer must detect the rewind via KIP-320 (OffsetForLeaderEpoch). franz-go only runs that validation when it observes a new leader epoch in metadata. During the all-nodes-down recovery window the broker advertises leader_epoch = -1 (no leader yet); franz-go counts each -1 as a leader-epoch rewind and, after maxEpochRewinds=5 (~1-2s of rapid retries), accepts -1 and skips validation. From then on only the next periodic metadata refresh can re-trigger detection, and franz-go's default MetadataMaxAge is 5m, far past the test's 60s wait, so the consumer stalls and the test times out. Upstream franz-go issue: twmb/franz-go#1331.

This change defaults kgo-verifier's kgo.MetadataMaxAge to 15s (via a new --metadata-max-age flag; pass 0 for franz-go's default), wired through WorkerConfig with an optional metadata_max_age_ms override on the ducktape KgoVerifierService. A 15s refresh re-triggers KIP-320 detection well within typical test timeouts, so recovery completes in time for every test that uses the verifier.

Validation: ~30% baseline failure rate dropped to 0 failures across 65 consecutive local runs (a clean 50/50 repeat run plus 15 earlier confirmation runs) at the 15s setting.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

kgo-verifier clients now default kgo.MetadataMaxAge to 15s instead of
franz-go's 5m. Adds a --metadata-max-age flag (override; pass 0 for franz-go's
default) wired through WorkerConfig; the ducktape KgoVerifierService exposes an
optional metadata_max_age_ms override.

Background (CORE-13458 flake): after an all-nodes crash the partition is
briefly leaderless and the broker advertises leader_epoch=-1 in metadata.
franz-go detects the resulting log rewind (KIP-320, via OffsetForLeaderEpoch
-> ErrDataLoss) only once it observes the new, higher leader epoch in
metadata, and that observation can happen two ways:

  1. During the rapid epoch-rewind retry burst: each -1 update triggers an
     immediate metadata re-fetch, up to maxEpochRewinds=5 (~1-2s total). If
     the new leader is elected and its higher epoch appears within those 5
     retries, validation fires and the consumer resets quickly. Otherwise the
     retries are exhausted while the partition is still leaderless (epoch -1)
     and franz-go falls into path 2.

  2. On the next periodic metadata refresh: with the 5 rewinds exhausted,
     franz-go accepts the -1 epoch "to allow forward progress", which skips
     validation; from then on only the next scheduled metadata refresh
     (governed by MetadataMaxAge) can re-trigger it.

Path 2 caused the flake: at the 5m default its latency dwarfs typical test
timeouts (e.g. the 60s wait in WriteCachingFailureInjectionE2ETest), so
whenever election outlasts the ~1-2s retry burst the consumer stalls for
minutes. A 15s default caps path 2 well under those timeouts, so both
detection paths recover in time, for every test that uses the verifier.

Validation: at the 5m default this test failed roughly 30% of local runs
(~3/10; and it recurs daily in CI per pandatriage). With the 15s setting it
passed 65/65 consecutive local runs (a clean 50/50 repeat run plus 15 earlier
confirmation runs), zero timeouts.

Upstream franz-go issue (the -1-as-rewind root cause):
twmb/franz-go#1331
Copilot AI review requested due to automatic review settings May 29, 2026 20:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR lowers kgo-verifier’s effective franz-go metadata refresh interval so continuous consumers can re-detect leader epoch changes after all-broker crash recovery within test timeouts.

Changes:

  • Adds a --metadata-max-age CLI flag defaulting to 15s, with 0 preserving franz-go’s default.
  • Wires MetadataMaxAge through WorkerConfig into kgo.MetadataMaxAge.
  • Adds an optional ducktape service override via metadata_max_age_ms.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/go/kgo-verifier/cmd/kgo-verifier/main.go Adds the CLI flag and passes it into worker configuration.
tests/go/kgo-verifier/pkg/worker/worker.go Stores and applies the metadata max age option to franz-go clients.
tests/rptest/services/kgo_verifier_services.py Allows tests to override the verifier metadata max age when launching the service.

@travisdowns
Copy link
Copy Markdown
Member Author

travisdowns commented Jun 1, 2026

failure is spurious; tracked in CORE-16410 (ducktape debug-shard hang in the post-test estimate_bytes_written() metrics scrape, unrelated to this PR): https://redpandadata.atlassian.net/browse/CORE-16410

@travisdowns
Copy link
Copy Markdown
Member Author

/ci-repeat 1
debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants