Skip to content

cloud_storage: Improve read-replica creation error handling #8893

Open
@Lazin

Description

Version & Environment

Redpanda version: (use rpk version): v22.2.x

What went wrong?

The original cluster is running redpanda v22.3.x. The read replica cluster is running redpanda v22.2.x. When the RR topic is created in v22.2.x cluster using the following RPK command rpk topic create -v xxxxtopicnamexxxx -p 20 -r 3 -c redpanda.remote.readreplica=xxxxbucketnamexxxx the command succeeds. But the RR topic does not have any data from the original topic. The log has the following error:

2023-02-14T00:12:40.902233918Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] cloud_storage - [fiber68~0~0|1|9907ms] - remote.cc:140 - Unexpected error std::runtime_error (partition_manifest.cc:748 - Failed to parse topic manifest {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}: Terminate parsing due to Handler error. at offset 124)
2023-02-14T00:12:40.902417342Z stderr F WARN  2023-02-14 00:12:40,902 [shard 11] cloud_storage - [fiber68~0~0|1|9907ms] - remote.cc:250 - Failed downloading manifest from {xxxxbucketnamexxxx} {failed}, manifest at {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}
2023-02-14T00:12:40.902480536Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] archival - [fiber68 kafka/xxxxtopicnamexxxx/10] - ntp_archiver_service.cc:270 - Failed to download partition manifest in read-replica mode
2023-02-14T00:12:40.90249025Z stderr F ERROR 2023-02-14 00:12:40,902 [shard 11] archival - [fiber68 kafka/xxxxtopicnamexxxx/10] - ntp_archiver_service.cc:253 - Failed to download manifest {"c0000000/meta/kafka/xxxxtopicnamexxxx/10_63/manifest.json"}

The log line contains 'topic manifest' but it means 'partition manifest' . This has to be changed.

This happened because the RR was able to download the topic manifest and create RR. The RPK command successfully returned. But then, the actual read replica partition weren't able to synchronize their state with the bucket because they can't parse the manifest. To avoid this problem we can change versioning in the topic manifest.

Currently, the topic manifest has only version field which is related to its own format. But we can treat it differently and use as a version of all manifests in the cloud. We can also use the same approach as Serde uses by adding a compat_version field to it. Currently, our tx-manifests are using this approach. With this change in place in the future we won't have the problem. The RPK will return an error that will say that the topic was created with newer version of redpanda and therefore incoompatible.

What should have happened instead?

How to reproduce the issue?

  1. create topic in v22.3.x cluster with tiered storage enabled
  2. create RR topic in v22.2.x
  3. analyze logs

Additional information

JIRA Link: CORE-1166

Metadata

Assignees

No one assigned

    Labels

    area/cloud-storageShadow indexing subsystemkind/bugSomething isn't workingremediationIncident follow-ups that also show up in Incident.io when this label is used.sev/mediumBugs that do not meet criteria for high or critical, but are more severe than low.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions