Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix consumer poll stuck error when no available partition #1375

Merged
merged 1 commit into from
Feb 9, 2018

Conversation

ckyoog
Copy link
Contributor

@ckyoog ckyoog commented Feb 9, 2018

The consumer poll is stuck due to long timeout when there is no more partition for it.

Actually, in function commit_offsets_async(), passing 0 as timeout into client.poll() is the very action the official Java source code does.

The consumer poll is stuck due to long timeout when there is no more partition for it.

Actually, in function commit_offsets_async(), passing 0 as timeout into client.poll() is the very action the official Java source code does.
@ckyoog
Copy link
Contributor Author

ckyoog commented Feb 9, 2018

The bug can be reproduced in this way:

  1. create topic 'test' with only one partition
  2. start a consumer with an explicit 'group_id' to consume topic 'test'
  3. start another consumer with same configs to subscribe the same topic

The second consumer will be stuck (sleep almost 300 seconds) at the place where my change is, even the first consumer is kicked out due to expiration (exceeding the max_poll_interval_ms), or even the first consumer exits (close).

The matching place of Java code is
in function commitOffsetsAsync()
in class ConsumerCoordinator
in file kafka-1.0.0-src/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

In Java code, commitOffsetsAsync() calls pollNoWakeup(), which passes 0 as timeout to the subsequent calls, up to client.poll(), in which the client is class NetworkClient.

So in this context, the call to self._client.poll() in python code is just equivalent to the call to client.poll() in Java code, but the java one gets 0 as timeout whereas the python one doesn't.

@dpkp dpkp merged commit 990e928 into dpkp:master Feb 9, 2018
@dpkp
Copy link
Owner

dpkp commented Feb 9, 2018

Thanks!

@ckyoog ckyoog deleted the poll_stuck_error branch February 10, 2018 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants