Improve metadata bwc test for Logical Replication #354

jeeminso · 2025-06-01T00:55:36Z

Summary of the changes / Why this is an improvement

Checklist

Link to issue this PR refers to (if applicable): Fixes #???

jeeminso · 2025-06-03T16:12:29Z

tests/bwc/test_rolling_upgrade.py

+            # Set up tables for logical replications
+            if int(path.from_version.split('.')[0]) >= 5 and int(path.from_version.split('.')[1]) >= 10:
+                c.execute("create table doc.x (a int) clustered into 1 shards with (number_of_replicas=0)")
+                expected_active_shards += 1
+                c.execute("create publication p for table doc.x")
+                with connect(replica_cluster.node().http_url, error_trace=True) as replica_conn:
+                    rc = replica_conn.cursor()
+                    rc.execute("create table doc.rx (a int) clustered into 1 shards with (number_of_replicas=0)")
+                    rc.execute("create publication rp for table doc.rx")
+                    rc.execute(f"create subscription rs connection 'crate://localhost:{cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication p")
+                    wait_for_active_shards(rc)
+                c.execute(f"create subscription s connection 'crate://localhost:{replica_cluster.node().addresses.transport.port}?user=crate&sslmode=sniff' publication rp")
+                wait_for_active_shards(c)


If I remove the calls to wait_for_active_shards and move onto rolling upgrades immediately, I observe unexpected behaviours like UnavailableShardsException or the number of rows replicated do not add up correctly. But to my knowledge, users are recommended to wait for active shards before upgrading, so this is not an issue?

tests/bwc/test_rolling_upgrade.py

jeeminso · 2025-06-05T02:20:51Z

tests/bwc/test_rolling_upgrade.py

+                        c.execute("insert into doc.x values (1)")
+                        rc.execute("insert into doc.rx values (1)")
+
+                        rc.execute("select count(*) from doc.x")


crate.client.exceptions.ProgrammingError: RelationUnknown[Relation 'doc.x' unknown] io.crate.exceptions.RelationUnknown: Relation 'doc.x' unknown at io.crate.exceptions.RelationUnknown.of(RelationUnknown.java:46)

Guessing it means that the DROP stmt succeeded, looking into it.

jeeminso · 2025-06-05T23:11:20Z

The first commit tests LR during rolling upgrade 5.10 > jeeminso/temp and the second commit tests 5.10 > branch:master where the first passes and the second fails indicating that there is a regression caused by crate/crate#17960.

Logs from the first commit: https://jenkins.crate.io/job/CrateDB/job/qa/job/crate_qa_on_pr/675/execution/node/191/log/
Logs for the second commit: https://jenkins.crate.io/job/CrateDB/job/qa/job/crate_qa_on_pr/676/execution/node/198/log/

Hi @seut could you take a look? BTW, this problem is intermittent especially when tried to reproduce manually.

seut · 2025-06-06T12:21:02Z

@jeeminso
Thanks for this info, I'll have a look into this asap.

seut · 2025-06-11T13:07:16Z

I've looked into this and ran the related tests multiple times locally.

its flaky, it succeeds more often than it fails
if it fails, it mostly fails on the replica_cluster-> drop-replicated-table which runs 5.10.x. (as of this time, 5.10.9). During the restart of the publication cluster, due to the rolling upgrade, at one point the replicated table can be dropped even that the logs indicating that the subscription (and thus the tracker) is still running.

I do not understand yet what the real issue is but it feels very timing related. Such it does not break in general, all manual test I did are working as expected. I can also not see why #17960 would cause such issue, I think this was just coincidence and the same flaky failure can may be seen even without this change, also I did not test this.

I'd followup on this at a later point to debug it more deeply, if no one else figured this out until then.

jeeminso · 2025-07-30T20:55:29Z

This needs to wait for 6.0.1 release including crate/crate#18175.

jeeminso · 2025-10-23T20:46:21Z

Tried to merge since we have 6.0 stable release. Resolved the conflicts and seeing the test failing from 6.0 > 6.1.x.

seut · 2025-10-30T12:52:05Z

tests/bwc/test_rolling_upgrade.py

+    rc.execute("create table doc.rx (a int) clustered into 1 shards with (number_of_replicas=0)")
+    rc.execute("create publication rp for table doc.rx")
+
+    rc.execute(f"create subscription rs connection 'crate://localhost:{local_transport_port}?user=crate&sslmode=sniff' publication p")


This is expected to fail as soon as the nodes are upgraded as we do not support subscribing to cluster with a higher major/minor version, see https://cratedb.com/docs/crate/reference/en/latest/admin/logical-replication.html.

Replicating tables created on a cluster with higher major/minor version to a cluster with lower major/minor version is not supported.

Isn't this only for the tables created in different versions? The test case creates tables on the same version and rolling-upgrades the publisher cluster.

Ah you're right, sorry.

seut · 2025-10-30T15:35:58Z

This should work once crate/crate#18636 is merged and backported.

seut · 2025-11-07T08:53:47Z

retest this please

seut · 2025-11-07T16:10:33Z

@jeeminso
The fixes are now also backported to 6.1 so the new tests should pass.
But there are also other failures, like python linter ones. Could you check please and get this PR into a ready state? Thanks!

matriv · 2025-11-07T18:07:27Z

Fixed the lint warnings.

BaurzhanSakhariev · 2025-11-10T09:47:21Z

Checking crate/crate#18676

Retest this please

UPD: cluster state persistence issue is gone but still have LR failures

 File "/var/lib/jenkins/workspace/CrateDB/qa/crate_qa_on_pr/tests/bwc/test_rolling_upgrade.py", line 431, in test_logical_replication_queries

    assert_busy(lambda: self.assertEqual(num_docs_rx(c), count2 + 1))

jeeminso · 2025-11-10T19:56:43Z

Thanks! I ran this manually to get more details of the failure, https://github.com/crate/crate/pull/18636/files#r2511790461.

jeeminso · 2025-11-10T20:43:22Z

retest this please

jeeminso · 2025-11-12T02:27:42Z

retest this please

jeeminso · 2025-11-12T04:32:46Z

This test passes, https://jenkins.crate.io/job/CrateDB/job/qa/job/crate_qa_on_pr/891/execution/node/208/log/.

jeeminso force-pushed the jeeminso/lr branch from 55ebb8b to 9d00462 Compare June 3, 2025 15:49

jeeminso commented Jun 3, 2025

View reviewed changes

tests/bwc/test_rolling_upgrade.py Outdated Show resolved Hide resolved

jeeminso force-pushed the jeeminso/lr branch 3 times, most recently from a47df1d to e4b1b28 Compare June 4, 2025 21:30

This comment was marked as resolved.

Sign in to view

jeeminso force-pushed the jeeminso/lr branch from e4b1b28 to 79a0919 Compare June 5, 2025 01:20

jeeminso commented Jun 5, 2025

View reviewed changes

jeeminso force-pushed the jeeminso/lr branch from 79a0919 to 028be19 Compare June 5, 2025 20:58

jeeminso mentioned this pull request Jun 11, 2025

Test: verify num_docs from sys.shards accounts for docs replicated via logical replication crate/crate#18005

Closed

4 tasks

jeeminso force-pushed the jeeminso/lr branch 2 times, most recently from 5aa84d9 to 180e243 Compare June 11, 2025 22:20

mfussenegger added this to CrateDB 6.0 Jun 17, 2025

mfussenegger moved this to Must in CrateDB 6.0 Jun 17, 2025

jeeminso mentioned this pull request Jul 16, 2025

Fix intermittent subscription loss in logical replication during rolling upgrades crate/crate#18175

Merged

4 tasks

jeeminso force-pushed the jeeminso/lr branch 3 times, most recently from ac98688 to 297945b Compare July 18, 2025 16:00

mergify bot mentioned this pull request Jul 21, 2025

Fix intermittent subscription loss in logical replication during rolling upgrades (backport #18175) crate/crate#18198

Merged

4 tasks

jeeminso force-pushed the jeeminso/lr branch 3 times, most recently from 4dfc293 to 3f6a583 Compare October 23, 2025 20:05

jeeminso marked this pull request as ready for review October 23, 2025 20:05

jeeminso force-pushed the jeeminso/lr branch from 3f6a583 to 78cf5d4 Compare October 23, 2025 20:44

jeeminso force-pushed the jeeminso/lr branch from 78cf5d4 to cc5380a Compare October 26, 2025 00:20

jeeminso mentioned this pull request Oct 29, 2025

Fix logical replication failing to replicate data during rolling upgrades crate/crate#18630

Closed

4 tasks

seut reviewed Oct 30, 2025

View reviewed changes

seut mentioned this pull request Nov 6, 2025

Fix logical replication with clusters running versions < 6.1.0 crate/crate#18636

Merged

matriv force-pushed the jeeminso/lr branch from cc5380a to f9596f4 Compare November 7, 2025 18:06

Improve metadata bwc test for Logical Replication

e1657b7

matriv force-pushed the jeeminso/lr branch from f9596f4 to e1657b7 Compare November 7, 2025 18:08

Changing 6.1.x > 6.1 makes the test pass

14ba2e3

Improve metadata bwc test for Logical Replication #354

Are you sure you want to change the base?

Improve metadata bwc test for Logical Replication #354

Uh oh!

Conversation

jeeminso commented Jun 1, 2025

Summary of the changes / Why this is an improvement

Checklist

Uh oh!

jeeminso Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

jeeminso Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

jeeminso commented Jun 5, 2025

Uh oh!

seut commented Jun 6, 2025

Uh oh!

seut commented Jun 11, 2025

Uh oh!

jeeminso commented Jul 30, 2025

Uh oh!

jeeminso commented Oct 23, 2025

Uh oh!

seut Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

jeeminso Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

seut Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

seut commented Oct 30, 2025

Uh oh!

seut commented Nov 7, 2025

Uh oh!

seut commented Nov 7, 2025

Uh oh!

matriv commented Nov 7, 2025

Uh oh!

BaurzhanSakhariev commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeeminso commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeeminso commented Nov 10, 2025

Uh oh!

jeeminso commented Nov 12, 2025

Uh oh!

jeeminso commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BaurzhanSakhariev commented Nov 10, 2025 •

edited

Loading

jeeminso commented Nov 10, 2025 •

edited

Loading