test(red-197205): regression test for zero-rate stats with --reconnect-interval#384
Open
fcostaoliveira wants to merge 3 commits into
Open
Conversation
paulorsousa
previously approved these changes
May 17, 2026
fcostaoliveira
added a commit
to filipecosta90/memtier_benchmark
that referenced
this pull request
May 18, 2026
…ncy quantization CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5, the trailing per-second bucket can hold very few ops (Count as low as 1). 'Accumulated Latency' is emitted with %lld (1 ms precision), so the reported value can differ from the true sum by up to 1 ms — which means the derived avg = accumulated/count can differ from the double-precision 'Average Latency' by up to (1 / count) ms. The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333), well within the quantization bound but outside the static tolerance. Add a third term `1.0 / count` so the quantization error is modeled explicitly. For large-count buckets (the typical case in every existing test) the existing tolerance still dominates, so no behavior change for them. Verified by replaying the exact failing workload locally: -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1 old tolerance: 3 violations across Sets/Gets/Totals bucket 5 new tolerance: 0 violations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-interval RED-197205 reported that running with --reconnect-interval=1 produced a final summary where Totals.Ops/sec, KB/sec, Hits/sec and Misses/sec all rendered as 0.00 even though Count and Latency were correct. Root cause (on v2.2.1): client_group::merge_run_stats() iterated *all* clients including those that were prepare()-d but never reached set_start_time(). Their zeroed m_start_time was factorial-averaged into the aggregate, dragging it toward the epoch. With m_end_time correct but m_start_time near zero, test_duration_usec became huge and every `ops / test_duration_usec * 1000000` division collapsed to ~0. Fix landed in 0228bac (PR redis#350): a run_stats::m_started atomic flag plus has_started() guards in merge_run_stats() and aggregate_inst_histogram() that skip never-started clients. RED-197205 was a side-effect fix, not the headline of that PR, so this test pins down the invariant explicitly. Verified end-to-end against the test's exact workload: v2.2.1 + 4t/4c --test-time=5 --ratio=1:1 --reconnect-interval=1 -> Ops/sec=0.00, KB/sec=0.00, Misses/sec=0.00 (bug fires) master + same args -> Ops/sec=52320, KB/sec=3035, Misses/sec=26114 (fixed) The bug only fires reliably with --test-time (not -n); with -n every client runs to completion and reaches set_start_time, so the zero-state merge path is never hit. _build_benchmark gains an optional test_time parameter so the helper can express that distinction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ncy quantization CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5, the trailing per-second bucket can hold very few ops (Count as low as 1). 'Accumulated Latency' is emitted with %lld (1 ms precision), so the reported value can differ from the true sum by up to 1 ms — which means the derived avg = accumulated/count can differ from the double-precision 'Average Latency' by up to (1 / count) ms. The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333), well within the quantization bound but outside the static tolerance. Add a third term `1.0 / count` so the quantization error is modeled explicitly. For large-count buckets (the typical case in every existing test) the existing tolerance still dominates, so no behavior change for them. Verified by replaying the exact failing workload locally: -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1 old tolerance: 3 violations across Sets/Gets/Totals bucket 5 new tolerance: 0 violations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
memtier rejects --reconnect-interval together with --cluster-mode
("error: cluster mode dose not support reconnect-interval option"),
so the test exited non-zero on every cluster CI job. The RED-197205
regression is not cluster-specific — guard the test with env.isCluster()
so it only runs in single-endpoint configurations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
930f554 to
b90142c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why this matters
Before PR #350, `client_group::merge_run_stats()` iterated all clients including those that were `prepare()`-d but never reached `set_start_time()`. Their zeroed `m_start_time` was factorial-averaged into the aggregate, pulling it toward the epoch. With `m_end_time` correct but `m_start_time` near zero, `test_duration_usec` became huge and every `ops / test_duration_usec * 1000000` division collapsed to ~0. The latency fields were correct because they're computed independently of `test_duration_usec` (`accumulated_latency / count`), which is exactly the asymmetry the bug report observed.
Verification
Locally reproduced + confirmed the fix using the test's exact workload (`-t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1`):
The test would have caught the bug on v2.2.1; master passes cleanly.
Test-shape notes
Test plan
🤖 Generated with Claude Code
Note
Low Risk
Low risk: changes are confined to the test suite, adding a new regression test and slightly relaxing latency tolerance for low-count time-series buckets.
Overview
Adds a new RLTest regression case (
test_reconnect_interval_does_not_zero_out_rates) to ensure--reconnect-interval=1does not produce all-zero throughput rate fields (Ops/sec,KB/sec,Misses/sec) in the final JSON summary, while re-validating existing latency consistency checks.Updates the JSON time-series latency consistency check to account for 1ms quantization error in
Accumulated Latencyat low counts, and extends_build_benchmarkwith an optionaltest_timemode (mutually exclusive withrequests) to support time-bounded runs.Reviewed by Cursor Bugbot for commit b90142c. Bugbot is set up for automated code reviews on this repo. Configure here.