-
Notifications
You must be signed in to change notification settings - Fork 6
refactor(inkless): add artificial latency to metadata update #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Inkless topic metadata update logic to avoid OutOfOrderSequence errors by introducing an artificial latency delay when processing metadata for inkless topics. Key changes include updating the InklessTopicMetadataTransformer constructor to require a Time instance, updating tests accordingly, and modifying the KafkaApis reference to include the new parameter.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
storage/inkless/src/test/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformerTest.java | Updated transformer instantiation and added null tests for the new Time parameter |
storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java | Introduced a Time dependency and added a sleep call to add latency for inkless topics |
core/src/main/scala/kafka/server/KafkaApis.scala | Updated transformer instantiation to pass the Time instance |
storage/inkless/src/main/java/io/aiven/inkless/metadata/InklessTopicMetadataTransformer.java
Show resolved
Hide resolved
When refreshing metadata, new requests to next broker may be processed _before_ the last request on the previous broker, causing OutOfOrderSequence. These errors are retried and data finally lands in order. However to avoid this error, this proposal adds artificial latency to the metadata response when it includes inkless topics. This way it gives enough time for the previous broker to complete processing requests, before the new one takes on.
4749007
to
9432574
Compare
// Introduce artificial latency to avoid a race condition between the metadata update and the producer | ||
// causing OutOfOrderSequenceException. | ||
time.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea for several reasons:
- I don't see how waiting for fixed delay of 500 ms may solve the problem... As produce and metadata updates happen asynchronously, they may interleave in arbitrary order with or without the sleep.
- Blocking the handler thread isn't great.
I think the problem is theoretically possible on classic Kafka, too, it's juts unlikely because of more leadership stability + the NOT_LEADER_OR_FOLLOWER
error. How bad would be it if we just do nothing and rely on existing recovery mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As produce and metadata updates happen asynchronously, they may interleave in arbitrary order with or without the sleep.
From what I found, metadata updates are not async (e.g. https://github.com/apache/kafka/blob/3c1f965c60789dcc8ee14ebabcbb4e16ebffc5ee/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L642) hence 500ms (choose arbitrarily but aiming to cope with a rotation on the broker side, default 250ms) would give room to avoid the race condition; though I'd agree any other implementation could potentially handle this update asynchronously and this solution may not be enough.
I think the problem is theoretically possible on classic Kafka, too
I don't think it is with partition leaders. The producer state is cached on the leader and if leadership changes, it starts accepting requests after updating the state.
How bad would be it if we just do nothing and rely on existing recovery mechanism?
OutOfOrderSequence errors are retried: https://github.com/apache/kafka/blob/6f783f85362071f82da3dcef706c7e6b89b86c2a/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L829-L834
We could just document that this error is expected and retried, and users could safely ignore it on their logs.
Another side effect we have observed is that retries lead to latency spikes (as the retry is handled within producer machinery so from the request time it just takes longer) -- though again this could be documented.
This PR is being marked as stale since it has not had any activity in 90 days. If you If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact). If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
When refreshing metadata, new requests to next broker may be processed before the last request on the previous broker, causing OutOfOrderSequence.
These errors are retried and data finally lands in order. However to avoid this error, this proposal adds artificial latency to the metadata response when it includes inkless topics. This way it gives enough time for the previous broker to complete processing requests, before the new one takes on.
I don't see any major reason why adding latency to this call would be detrimental. Metadata refresh happens in between requests, and it should happen once every 1-5minutes, depending on the
metadata.max.age.ms