Skip to content

Conversation

@dumbbell
Copy link
Collaborator

@dumbbell dumbbell commented Feb 20, 2025

The fixes come from the following pull requests:

They are backported together to reduce the number of pull requests and the load on CI. Also, CI would likely fail a lot more with one of the fixes missing.

There is still work to do to fix all test flakes, but backporting these will already bring an improvement for the v4.0.x branch.

@dumbbell dumbbell self-assigned this Feb 20, 2025
@dumbbell dumbbell changed the base branch from main to v4.0.x February 20, 2025 13:45
@dumbbell dumbbell force-pushed the backport-test-fixes-from-main branch 2 times, most recently from 63f1d8a to 5e1538b Compare February 20, 2025 14:52
[Why]
The `force_reset` command simply removes local files on disk for the
local node.

In the case of Ra, this can't work because the rest of the cluster does
not know about the forced-reset node. Therefore the leader will continue
to send `append_entry` commands to the reset node.

If that forced-reset node restarts and receives these messages, it will
either join the cluster again (because it's on an older Raft term) or it
will hit an assertion and exit (because it's on the same Raft term).

[How]
Given we can't really support this scenario and it has little value, the
command will now return an error if someone attemps a `force_reset` with
a node running Khepri.

This also deprecates the command: once Mnesia support is removed, the
command will be removed at the same time. This is noted in the
rabbitmqctl.8 manpage.

(cherry picked from commit c78aec7)
[Why]
We hit some transient errors with the previous order when doing
mixed-version testing. Swapping the nodes seems to fix the problem.

(cherry picked from commit 5cbda4c)
... are being used at the same time.

[Why]
Depending on which node clusters with which, a node running an older
version of the Khepri Ra machine may not be able to apply Ra commands
and could be stuck.

There is no real solution and this clearly an unsupported scenario. An
old node won't always be able to join a newer cluster.

[How]
In the testsuites, we skip clustering tests if we detect that multiple
Khepri Ra machine versions are being used.

(cherry picked from commit 1f1a135)
[Why]
During mixed-version testing, the old node might not be able to join or
rejoin a cluster if the other nodes run a newer Khepri machine version.

[How]
The old node is used as the cluster seed node and is never touched
otherwise. Other nodes are restarted or join the cluster later.

(cherry picked from commit e76233a)
… with Khepri

[Why]
This test plays with the Mnesia database explicitly.

(cherry picked from commit f088c4f)
[Why]
We see nodes trying to use busy ports in CI from time to time.

(cherry picked from commit e76c227)
... in retry_if_coordinator_unavailable().

(cherry picked from commit ee0b5b5)
This may help debug nodes that try to open busy ports.

(cherry picked from commit a5f30ea)
@dumbbell dumbbell force-pushed the backport-test-fixes-from-main branch from bfa8721 to 02c7b04 Compare February 20, 2025 19:15
@dumbbell dumbbell marked this pull request as ready for review February 20, 2025 21:27
@dumbbell dumbbell merged commit 6955665 into v4.0.x Feb 20, 2025
270 checks passed
@dumbbell dumbbell deleted the backport-test-fixes-from-main branch February 20, 2025 21:27
@dumbbell dumbbell added this to the 4.0.7 milestone Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants