Skip to content

Read or Write won't recover when write/read/lookup request is sent to coordinator server. #2604

@loserwang1024

Description

@loserwang1024

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.8.0 (latest release)

Please describe the bug 🐞

Currently, if we upgrade (or restart) multiple servers simultaneously, their pods may exchange IP addresses. This can cause issues because write/read requests might be sent to the wrong server.

If only multiple tablet servers are restarted, it is generally acceptable. In this case, when a server receives an incorrect write/read request and cannot find the leader, it returns an InvalidMetadataException. The client then updates its metadata accordingly.

However, if both the coordinator and tablet servers are restarted, the client throws an UnsupportedVersionException, which is not an InvalidMetadataException, so the client does not refresh its metadata. Unlike a NetworkException—which can be recovered from by retrying—the job will become stuck and fail indefinitely.

2026-02-04 20:12:28,595 ERROR org.apache.fluss.client.table.scanner.log.LogFetcher         [] - Failed to fetch log from node 49
org.apache.fluss.exception.UnsupportedVersionException: The server does not support FETCH_LOG(1015)
The server does not support LOOKUP(1017);

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions