refactor(inkless): add artificial latency to metadata update #337

jeqo · 2025-06-24T08:39:10Z

When refreshing metadata, new requests to next broker may be processed before the last request on the previous broker, causing OutOfOrderSequence.
These errors are retried and data finally lands in order. However to avoid this error, this proposal adds artificial latency to the metadata response when it includes inkless topics. This way it gives enough time for the previous broker to complete processing requests, before the new one takes on.

I don't see any major reason why adding latency to this call would be detrimental. Metadata refresh happens in between requests, and it should happen once every 1-5minutes, depending on the metadata.max.age.ms

Copilot

Pull Request Overview

This PR refactors the Inkless topic metadata update logic to avoid OutOfOrderSequence errors by introducing an artificial latency delay when processing metadata for inkless topics. Key changes include updating the InklessTopicMetadataTransformer constructor to require a Time instance, updating tests accordingly, and modifying the KafkaApis reference to include the new parameter.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
storage/inkless/src/test/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformerTest.java	Updated transformer instantiation and added null tests for the new Time parameter
storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java	Introduced a Time dependency and added a sleep call to add latency for inkless topics
core/src/main/scala/kafka/server/KafkaApis.scala	Updated transformer instantiation to pass the Time instance

storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java

When refreshing metadata, new requests to next broker may be processed _before_ the last request on the previous broker, causing OutOfOrderSequence. These errors are retried and data finally lands in order. However to avoid this error, this proposal adds artificial latency to the metadata response when it includes inkless topics. This way it gives enough time for the previous broker to complete processing requests, before the new one takes on.

ivanyu · 2025-06-24T10:06:51Z

storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java

+            // Introduce artificial latency to avoid a race condition between the metadata update and the producer
+            // causing OutOfOrderSequenceException.
+            time.sleep(500);


I'm not sure it's a good idea for several reasons:

I don't see how waiting for fixed delay of 500 ms may solve the problem... As produce and metadata updates happen asynchronously, they may interleave in arbitrary order with or without the sleep.

Blocking the handler thread isn't great.

I think the problem is theoretically possible on classic Kafka, too, it's juts unlikely because of more leadership stability + the NOT_LEADER_OR_FOLLOWER error. How bad would be it if we just do nothing and rely on existing recovery mechanism?

As produce and metadata updates happen asynchronously, they may interleave in arbitrary order with or without the sleep.

From what I found, metadata updates are not async (e.g. https://github.com/apache/kafka/blob/3c1f965c60789dcc8ee14ebabcbb4e16ebffc5ee/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L642) hence 500ms (choose arbitrarily but aiming to cope with a rotation on the broker side, default 250ms) would give room to avoid the race condition; though I'd agree any other implementation could potentially handle this update asynchronously and this solution may not be enough.

I think the problem is theoretically possible on classic Kafka, too

I don't think it is with partition leaders. The producer state is cached on the leader and if leadership changes, it starts accepting requests after updating the state.

How bad would be it if we just do nothing and rely on existing recovery mechanism?

OutOfOrderSequence errors are retried: https://github.com/apache/kafka/blob/6f783f85362071f82da3dcef706c7e6b89b86c2a/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L829-L834
We could just document that this error is expected and retried, and users could safely ignore it on their logs.
Another side effect we have observed is that retries lead to latency spikes (as the retry is handled within producer machinery so from the request time it just takes longer) -- though again this could be documented.

github-actions · 2025-09-25T03:39:21Z

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

jeqo requested a review from Copilot June 24, 2025 08:39

Copilot AI reviewed Jun 24, 2025

View reviewed changes

storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java Show resolved Hide resolved

jeqo force-pushed the jeqo/add-latency-metadata-update branch from 4749007 to 9432574 Compare June 24, 2025 08:48

jeqo marked this pull request as ready for review June 24, 2025 09:04

jeqo requested a review from ivanyu June 24, 2025 09:04

ivanyu reviewed Jun 24, 2025

View reviewed changes

jeqo marked this pull request as draft June 26, 2025 15:00

github-actions bot added the stale label Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(inkless): add artificial latency to metadata update #337

refactor(inkless): add artificial latency to metadata update #337

Uh oh!

jeqo commented Jun 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

ivanyu Jun 24, 2025

Uh oh!

jeqo Jun 24, 2025

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

Uh oh!

refactor(inkless): add artificial latency to metadata update #337

Are you sure you want to change the base?

refactor(inkless): add artificial latency to metadata update #337

Uh oh!

Conversation

jeqo commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

ivanyu Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jeqo Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 25, 2025

Uh oh!

Uh oh!

jeqo commented Jun 24, 2025 •

edited

Loading