Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow rdkafka versions newer than 0.15.0 #388

Merged
merged 4 commits into from
Feb 25, 2025

Conversation

hoangshopify
Copy link
Contributor

@hoangshopify hoangshopify commented Jan 27, 2025

What

This PR completes a housekeeping item:

  • Allow rdkafka versions newer than 0.15.0

@hoangshopify
Copy link
Contributor Author

@Thomascountz @bquorning @razumau i could use 👀 on this please

@mensfeld
Copy link

No breaking changes that would affect racecar in this release (rdkafka maintainer here).

@hoangshopify
Copy link
Contributor Author

No breaking changes that would affect racecar in this release (rdkafka maintainer here).

@mensfeld How might we go about getting this into a release?

@mensfeld
Copy link

@mensfeld How might we go about getting this into a release?

I don't know and really don't care as I maintain aside from rdkafka (the gem you were updating) another kafka framework: https://github.com/karafka/karafka/

I am not a maintainer of racecar and just giving a friendly heads-up that I did not break any APIs in the release you upgraded to

@hoangshopify hoangshopify force-pushed the upgrade-rdkafka-0.18.0 branch from c0ae82f to 9b149ae Compare January 29, 2025 16:35
@hoangshopify hoangshopify changed the title Upgrade rdkafka 0.18.0 Allow rdkafka versions newer than 0.15.0 Jan 29, 2025
@bquorning
Copy link
Member

Some of the tests are failing with unknown keyword: :wait_timeout, probably related to karafka/rdkafka-ruby#484. Could you have a look please @hoangshopify ?

@hoangshopify hoangshopify force-pushed the upgrade-rdkafka-0.18.0 branch from 9b149ae to f73e7d1 Compare January 29, 2025 17:17
@hoangshopify
Copy link
Contributor Author

Some of the tests are failing with unknown keyword: :wait_timeout, probably related to karafka/rdkafka-ruby#484. Could you have a look please @hoangshopify ?

@bquorning I already pushed a fix for this

@hoangshopify
Copy link
Contributor Author

we have a new test failure after latest run, looks like cooperative-sticky assignment during a rebalance allows healthy consumers to keep processing their paritions is having issues. is it related to karafka/rdkafka-ruby#493? @mensfeld @bquorning

@mensfeld
Copy link

@hoangshopify

looks like cooperative-sticky assignment during a rebalance allows healthy consumers to keep processing their paritions is having issues

can you explain it better? This patch was needed to prevent cooperative sticky from hanging all processing endlessly and I had no problems with it. Exactly that code was later on merged to librdkafka.

@mensfeld
Copy link

rdkafka 0.18.0 uses librdkafka 2.5.0.

It was fixed in 2.6.0:

Issues: #4783. A consumer configured with the cooperative-sticky partition assignment strategy could get stuck in an infinite loop, with corresponding spike of main thread CPU usage. That happened with some particular orders of members and potential assignable partitions. Solved by removing the infinite loop cause. Happening since: 1.6.0 (#4800).

Aside from that there were few other fixes. If you show me the repro I can relate to that.

@@ -77,7 +77,7 @@ def deliver!
# The raising can be avoided if max_wait_timeout below is greater than
# config.message_timeout, but config is not available here (without
# changing the interface).
handle.wait(max_wait_timeout: 60, wait_timeout: 0.1)
handle.wait(max_wait_timeout: 60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment (line 71) is wrong now.

Also, there is another use of wait_time in spec/runner_spec.rb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted wait_time in spec/runner_spec.rb and also updated the comment in customer.rb. Let me know what the comment should be if the updated one doesn't make sense

@hoangshopify
Copy link
Contributor Author

@hoangshopify

looks like cooperative-sticky assignment during a rebalance allows healthy consumers to keep processing their paritions is having issues

can you explain it better? This patch was needed to prevent cooperative sticky from hanging all processing endlessly and I had no problems with it. Exactly that code was later on merged to librdkafka.

The integration test doesn't work here: cooperative_sticky_assignment_spec.rb and cooperative_sticky_assignment_spec.rb

@mensfeld
Copy link

@hoangshopify it seems to fail because of some segfaults. It seems like mishandling of rdkafka in racecar. Beyond scope of my support sorry 🙏

@razumau
Copy link
Contributor

razumau commented Feb 12, 2025

These segmentation faults turned out to be a red herring: they happen when we shut down consumers and don’t affect test results. This is something we should investigate separately.

The real reason why integration tests are failing is that on newer versions of rdkafka we process messages too fast. Wait what?

The failing spec is this sticky assignment test:

it "allows healthy consumers to keep processing their paritions" do

In this test, we create two consumers, generate 20 messages, wait for a bit, then shut down one of the consumers and check that there are revocation events for this consumer and not for the other one.

That “wait a bit” actually means “start yet another consumer and go through five messages”. The reason why this spec fails is that by the time this bonus consumer goes through its five messages, the main consumers complete all 20 messages, so when we shut down one of them, there are no revocations.

I am a bit confused by why this happens. I bisected rdkafka changes and found a commit after which the test starts to fail. But this was a change in producer setup, and the time it takes to publish these message hasn’t changed (according to my measurements). The time it takes for producers to finish their 20 messages has decreased, and that’s why the test fails.

Regardless of more specific reasons, I think that changing the overall number of messages we send in this spec is a good enough fix. @hoangshopify you can cherry-pick this commit or make the same change yourself.

@hoangshopify hoangshopify force-pushed the upgrade-rdkafka-0.18.0 branch from 60ded23 to 89b48aa Compare February 18, 2025 17:05
@hoangshopify
Copy link
Contributor Author

hoangshopify commented Feb 18, 2025

@razumau I really appreciate your investigation and detailed writeup, also thanks @razumau @bquorning so much to take that long to look at and support this PR!

@mensfeld Thank you for your input for rdkafka

@razumau razumau merged commit 238d634 into zendesk:master Feb 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants