cloud_storage: Improve read-replica creation error handling #8893
Description
Version & Environment
Redpanda version: (use rpk version
): v22.2.x
What went wrong?
The original cluster is running redpanda v22.3.x. The read replica cluster is running redpanda v22.2.x. When the RR topic is created in v22.2.x cluster using the following RPK command rpk topic create -v xxxxtopicnamexxxx -p 20 -r 3 -c redpanda.remote.readreplica=xxxxbucketnamexxxx
the command succeeds. But the RR topic does not have any data from the original topic. The log has the following error:
2023-02-14T00:12:40.902233918Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] cloud_storage - [fiber68~0~0|1|9907ms] - remote.cc:140 - Unexpected error std::runtime_error (partition_manifest.cc:748 - Failed to parse topic manifest {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}: Terminate parsing due to Handler error. at offset 124)
2023-02-14T00:12:40.902417342Z stderr F WARN 2023-02-14 00:12:40,902 [shard 11] cloud_storage - [fiber68~0~0|1|9907ms] - remote.cc:250 - Failed downloading manifest from {xxxxbucketnamexxxx} {failed}, manifest at {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}
2023-02-14T00:12:40.902480536Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] archival - [fiber68 kafka/xxxxtopicnamexxxx/10] - ntp_archiver_service.cc:270 - Failed to download partition manifest in read-replica mode
2023-02-14T00:12:40.90249025Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] archival - [fiber68 kafka/xxxxtopicnamexxxx/10] - ntp_archiver_service.cc:253 - Failed to download manifest {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}
The log line contains 'topic manifest' but it means 'partition manifest' . This has to be changed.
This happened because the RR was able to download the topic manifest and create RR. The RPK command successfully returned. But then, the actual read replica partition weren't able to synchronize their state with the bucket because they can't parse the manifest. To avoid this problem we can change versioning in the topic manifest.
Currently, the topic manifest has only version
field which is related to its own format. But we can treat it differently and use as a version of all manifests in the cloud. We can also use the same approach as Serde uses by adding a compat_version
field to it. Currently, our tx-manifests are using this approach. With this change in place in the future we won't have the problem. The RPK will return an error that will say that the topic was created with newer version of redpanda and therefore incoompatible.
What should have happened instead?
How to reproduce the issue?
- create topic in v22.3.x cluster with tiered storage enabled
- create RR topic in v22.2.x
- analyze logs
Additional information
JIRA Link: CORE-1166