kgo-verifier: default metadata-max-age to 15s (was franz-go's 5m)#30654
Open
travisdowns wants to merge 1 commit into
Open
kgo-verifier: default metadata-max-age to 15s (was franz-go's 5m)#30654travisdowns wants to merge 1 commit into
travisdowns wants to merge 1 commit into
Conversation
kgo-verifier clients now default kgo.MetadataMaxAge to 15s instead of
franz-go's 5m. Adds a --metadata-max-age flag (override; pass 0 for franz-go's
default) wired through WorkerConfig; the ducktape KgoVerifierService exposes an
optional metadata_max_age_ms override.
Background (CORE-13458 flake): after an all-nodes crash the partition is
briefly leaderless and the broker advertises leader_epoch=-1 in metadata.
franz-go detects the resulting log rewind (KIP-320, via OffsetForLeaderEpoch
-> ErrDataLoss) only once it observes the new, higher leader epoch in
metadata, and that observation can happen two ways:
1. During the rapid epoch-rewind retry burst: each -1 update triggers an
immediate metadata re-fetch, up to maxEpochRewinds=5 (~1-2s total). If
the new leader is elected and its higher epoch appears within those 5
retries, validation fires and the consumer resets quickly. Otherwise the
retries are exhausted while the partition is still leaderless (epoch -1)
and franz-go falls into path 2.
2. On the next periodic metadata refresh: with the 5 rewinds exhausted,
franz-go accepts the -1 epoch "to allow forward progress", which skips
validation; from then on only the next scheduled metadata refresh
(governed by MetadataMaxAge) can re-trigger it.
Path 2 caused the flake: at the 5m default its latency dwarfs typical test
timeouts (e.g. the 60s wait in WriteCachingFailureInjectionE2ETest), so
whenever election outlasts the ~1-2s retry burst the consumer stalls for
minutes. A 15s default caps path 2 well under those timeouts, so both
detection paths recover in time, for every test that uses the verifier.
Validation: at the 5m default this test failed roughly 30% of local runs
(~3/10; and it recurs daily in CI per pandatriage). With the 15s setting it
passed 65/65 consecutive local runs (a clean 50/50 repeat run plus 15 earlier
confirmation runs), zero timeouts.
Upstream franz-go issue (the -1-as-rewind root cause):
twmb/franz-go#1331
Contributor
There was a problem hiding this comment.
Pull request overview
This PR lowers kgo-verifier’s effective franz-go metadata refresh interval so continuous consumers can re-detect leader epoch changes after all-broker crash recovery within test timeouts.
Changes:
- Adds a
--metadata-max-ageCLI flag defaulting to 15s, with0preserving franz-go’s default. - Wires
MetadataMaxAgethroughWorkerConfigintokgo.MetadataMaxAge. - Adds an optional ducktape service override via
metadata_max_age_ms.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
tests/go/kgo-verifier/cmd/kgo-verifier/main.go |
Adds the CLI flag and passes it into worker configuration. |
tests/go/kgo-verifier/pkg/worker/worker.go |
Stores and applies the metadata max age option to franz-go clients. |
tests/rptest/services/kgo_verifier_services.py |
Allows tests to override the verifier metadata max age when launching the service. |
Member
Author
|
failure is spurious; tracked in CORE-16410 (ducktape debug-shard hang in the post-test |
Member
Author
|
/ci-repeat 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WriteCachingFailureInjectionE2ETest.test_crash_allis a long-standing flake (CORE-13458): it times out in roughly 30% of local runs and recurs daily in CI.The root cause is in the franz-go Kafka client used by
kgo-verifier, not in redpanda. The test crashes all brokers between produce rounds; with write caching the unflushed tail is lost and the log is rewritten at the same offsets under a new leader epoch, so the continuous consumer must detect the rewind via KIP-320 (OffsetForLeaderEpoch). franz-go only runs that validation when it observes a new leader epoch in metadata. During the all-nodes-down recovery window the broker advertisesleader_epoch = -1(no leader yet); franz-go counts each-1as a leader-epoch rewind and, aftermaxEpochRewinds=5(~1-2s of rapid retries), accepts-1and skips validation. From then on only the next periodic metadata refresh can re-trigger detection, and franz-go's defaultMetadataMaxAgeis 5m, far past the test's 60s wait, so the consumer stalls and the test times out. Upstream franz-go issue: twmb/franz-go#1331.This change defaults
kgo-verifier'skgo.MetadataMaxAgeto 15s (via a new--metadata-max-ageflag; pass 0 for franz-go's default), wired throughWorkerConfigwith an optionalmetadata_max_age_msoverride on the ducktapeKgoVerifierService. A 15s refresh re-triggers KIP-320 detection well within typical test timeouts, so recovery completes in time for every test that uses the verifier.Validation: ~30% baseline failure rate dropped to 0 failures across 65 consecutive local runs (a clean 50/50 repeat run plus 15 earlier confirmation runs) at the 15s setting.
Backports Required
Release Notes