Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.3.x] Fixed checking if batches are already present in follower log #25338

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Mar 12, 2025

Backport of PR #25018
Includes #25035

bharathv and others added 7 commits March 12, 2025 17:54
The current logic to wait on `enqueued` stage seems incorrect and is
causing the test to flake. `eneuqued` only guarantees that the request
is enqueued in the raft layer and subsequent replication may still fail
with not_leader error. So the sequence of flakiness is something like
this.

1. wait for enqueued futures to resolve
2. force a step down
3. wait for replicated entry to be truncated.

If (2) happens before the entry is appended to the leader log, the
replication future fails with a not leader resulting in a test retry.
Effectively (1) is not a tight enough condition to wait on to force
a step down.

This commit rewrites the test to replicate with a leader_ack and wait
for the future to be resolved (which guarantees a lleader append) and
then the replication monitor waiter should be resolved with a truncation
error after leadership change.

Additionally the test seems to be missing some cleanup between
retries, which is fixed

Signed-off-by: Michał Maślanka <michal@redpanda.com>
When follower receives an append entries request that `prev_log_index` is
smaller than its own `prev_log_index` it validates if the
batches from the request matches (by checking a batch offset and
corresponding term) its own log. It that is the case the batches are
skipped to prevent truncation of valid batches and avoid the data loss.

The validation of already matching batches was broken if they happened
to be at the beginning of the log. In this case the `prev_log_index` is
not initialized being negative. This case was not correctly handled by
the logic calculating the next offset when checking matching batches.

That lead to a situation in which a range of batches starting with 0 was
never matching.

Fixed the issue by correctly adjusting the `prev_log_index` if it is
uninitialized.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 0d029cf)
When follower receives an append entries request with the vector of
records that are all present in its own log and their offsets and terms
match it should reply with success and correct `last_dirty_log_index`.
This way a leader instead of moving the follower `next_offset` backward
can start recovery process and deliver batches which the follower is
missing.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit edd55dc)
Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit bd47cd9)
Assertion triggered in function body are not propagated to the test
itself. Change the method to throw an exception in case of timeout
instead of using an assertion.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit e2ec035)
The reply interceptor allows test creator to modify or drop the reply
that is about to be processed by the RPC requester. This allow tests to
take more control over the Raft protocol behavior and test some rare
edge cases which might be hard to trigger otherwise.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 9627077)
Signed-off-by: Michał Maślanka <michal@redpanda.com>
(cherry picked from commit 59e43a7)
@mmaslankaprv mmaslankaprv added this to the v24.3.x-next milestone Mar 12, 2025
@mmaslankaprv mmaslankaprv added the kind/backport PRs targeting a stable branch label Mar 12, 2025
@mmaslankaprv mmaslankaprv marked this pull request as ready for review March 12, 2025 16:56
@mmaslankaprv mmaslankaprv requested review from bharathv and bashtanov and removed request for bharathv March 12, 2025 16:56
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#62993
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/62993#01958b91-183c-4c2f-8f3a-fa0ce3db1a35 FLAKY 1/2
rptest.tests.datalake.compaction_gaps_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/62993#01958b91-183b-4048-b7aa-6cc1cf66f679 FLAKY 1/4
rptest.tests.idempotency_test.IdempotencyWriteCachingTest.test_idempotent_producers_write_caching ducktape https://buildkite.com/redpanda/redpanda/builds/62993#01958ba3-bc91-4df8-9f2c-86ac9289ea24 FLAKY 1/3

@mmaslankaprv mmaslankaprv requested a review from ztlpn March 12, 2025 20:24
@piyushredpanda piyushredpanda merged commit 71f1506 into redpanda-data:v24.3.x Mar 12, 2025
18 checks passed
@piyushredpanda piyushredpanda modified the milestones: v24.3.x-next, v24.3.7 Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants