Only fetch for partitions with initialized offsets #582

Nevon · 2019-12-06T10:44:26Z

Second attempt at solving the issue with concurrent recovery of offset out of range, this time without locks.

Fixes #555

Closes #578

tulios · 2019-12-06T10:47:03Z

@JaapRood this is a new approach to the problem describe in the Lock PR.

Nevon · 2019-12-06T10:57:36Z

To reproduce the issue, I ran the branch replace-fetch-promise-all-with-generator. First I created a topic with the following configuration:

{
  topic,
  numPartitions: 3,
  configEntries: [
    { name: 'delete.retention.ms', value: '100' },
    { name: 'file.delete.delay.ms', value: '200' },
    { name: 'min.cleanable.dirty.ratio', value: '0.01' },
    { name: 'retention.ms', value: '1' },
    { name: 'segment.bytes', value: '50000' },
    { name: 'max.compaction.lag.ms', value: '200' },
  ],
}

I produced a bunch of messages to that topic, and then I started a consumer and consumed a bit. Then I killed my consumer and kept producing until the broker did a cleanup of old segments (this takes minutes btw...) Now when I started the consumer again, it would break because the old committed offset was now invalid, as described in #578.

With this fix, what happened instead was that we detected that some partitions didn't have a valid offset, and didn't fetch for them. Shortly after, each partition had recovered and fetches continued as usual.

JaapRood · 2019-12-06T13:35:33Z

src/consumer/consumerGroup.js

@@ -368,13 +368,34 @@ module.exports = class ConsumerGroup {
        )

        const leaders = keys(partitionsPerLeader)
+        const committedOffsets = this.offsetManager.committedOffsets()


Is there a reason why we're using committedOffsets here, instead of just resolved? An invalid offset could come from attempting to resume from a committed offset, but also from a consumer.seek. Your great comment touches on how both are cleared, so I'm not sure whether there would be any actual difference in behaviour, but as future changes are made this subtle difference might be harder to spot while becoming more consequential.

Initially I was actually using resolvedOffsets, but that didn't work because OffsetManager.resolveOffsets actually just sets the initialized consumer offsets in committedOffsets, not in resolvedOffsets (did someone mention that our naming is confusing...? 😅). When the consumer first boots, it doesn't actually have any resolved offsets, so the only source of offsets is the initialized offsets in committedOffsets.

Regarding the seek behavior, I would expect it to work the same way, no? Seek would commit (potentially invalid) offsets and then clear both committedOffsets and resolvedOffsets using OffsetManager.clearOffsets. In the fetch loop we'd get the consumer offsets from the brokers and from there on it's the same.

Maybe I'm missing something?

Regarding the seek behavior, I would expect it to work the same way, no? Seek would commit (potentially invalid) offsets

Seeking shouldn't commit, only move the "playhead", see #395, so to rely on that behaviour is probably not the thing we want.

Initially I was actually using resolvedOffsets, but that didn't work because OffsetManager.resolveOffsets actually just sets the initialized consumer offsets in committedOffsets, not in resolvedOffsets (did someone mention that our naming is confusing...? 😅).

I guess having to use committedOffsets is a symptom of there being an issue in there then. Conceptually, it's the resolvedOffsets (which I understand is the "next to consume offset" or "playhead" for reading the log) that should always exist and the committed offset which is optional, as using Kafka for committing offsets is / should be totally optional (see #395).

Since that seems like a different issue, maybe it's an idea we create a separate issue for it and tag that in a comment. Being able to spot outside of the context of these changes that we conceptually want the resolved offsets rather than committed there might be a lot to ask from our future selves (or others) 😅.

Since that seems like a different issue, maybe it's an idea we create a separate issue for it and tag that in a comment.

That sounds like a good idea. I would prefer to do that kind of holistic refactoring in a PR that doesn't actually change any behavior, rather than squeezing it into a bugfix. Could you create that issue?

Created the issue, trying to preserve the context of this conversation properly: #585. To help the audit suggested in there, I'd suggest linking that issue in a comment above where committedOffsets() is called.

JaapRood · 2019-12-06T13:38:24Z

Pragmatic fix, no locks, 🙌.

tulios · 2019-12-09T10:16:43Z

The pre-release version 1.12.0-beta.11 was published with this fix.

Only fetch for partitions with initialized offsets

0639b29

Fixes #567 #555 Closes #578

Nevon added the bug label Dec 6, 2019

Nevon requested a review from JaapRood December 6, 2019 10:44

tulios mentioned this pull request Dec 6, 2019

Azure: New EventHubs (Kafka Surface) don't have an initial offset for nextOffset() #555

Closed

JaapRood reviewed Dec 6, 2019

View reviewed changes

Merge branch 'master' into fix-race-condition-when-resetting-offsets-v2

d33fc00

JaapRood mentioned this pull request Dec 9, 2019

OffsetManager should use and expose resolveOffsets for tracking fetched offsets, to allow seeking behaviour to work correctly #585

Open

tulios merged commit 6002347 into master Dec 9, 2019

tulios deleted the fix-race-condition-when-resetting-offsets-v2 branch December 9, 2019 09:55

This was referenced Jan 2, 2020

Rebalance breaks running consumers: Cannot read property '0' of undefined #558

Open

Azure Eventhub: "[Consumer] Crash: KafkaJSNumberOfRetriesExceeded: Cannot read property 'low' of undefined" #618

Open

julien-marchand mentioned this pull request Feb 28, 2020

[Snyk] Upgrade kafkajs from 1.11.0 to 1.12.0 Consensys/orchestrate-node#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only fetch for partitions with initialized offsets #582

Only fetch for partitions with initialized offsets #582

Nevon commented Dec 6, 2019 •

edited

Loading

tulios commented Dec 6, 2019

Nevon commented Dec 6, 2019

JaapRood Dec 6, 2019

Nevon Dec 6, 2019

JaapRood Dec 6, 2019

Nevon Dec 6, 2019 •

edited

Loading

JaapRood Dec 9, 2019

JaapRood commented Dec 6, 2019

tulios commented Dec 9, 2019

Only fetch for partitions with initialized offsets #582

Only fetch for partitions with initialized offsets #582

Conversation

Nevon commented Dec 6, 2019 • edited Loading

tulios commented Dec 6, 2019

Nevon commented Dec 6, 2019

JaapRood Dec 6, 2019

Choose a reason for hiding this comment

Nevon Dec 6, 2019

Choose a reason for hiding this comment

JaapRood Dec 6, 2019

Choose a reason for hiding this comment

Nevon Dec 6, 2019 • edited Loading

Choose a reason for hiding this comment

JaapRood Dec 9, 2019

Choose a reason for hiding this comment

JaapRood commented Dec 6, 2019

tulios commented Dec 9, 2019

Nevon commented Dec 6, 2019 •

edited

Loading

Nevon Dec 6, 2019 •

edited

Loading