-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade dependency on crc32c to non-optional? #2380
Comments
Historically the kafka-python project philosophy has been to be pure-python and not have any required dependencies. I recognize that the python packaging ecosystem has improved substantially in the past several years, and w/ wheels generally available perhaps we should reconsider. But this is a pretty big change from our current philosophy so not something we should take on lightly. Right now crc32c is in the "extra_requires" bucket and would be installed for any user that selects extras. I might be more inclined to create a more aggressive set of 'optimizations' extra_requires or improve the documentation on these packages. |
Thanks @dpkp for your swift answer!
Although I hadn't read this explicitly, I was guessing this was current reasoning, thanks for confirming. Also, to make the point explicit: adding a dependency on a C extension module only breaks the rule of
Agreed. This would be a leap from zero to non-zero dependencies, so I understand it would have to be considered carefully. I wanted to get the ball rolling on having this discussion though, since it would potentially be an easy performance win for many users that might not be aware of the availability of optional requirements.
This would also a be a good first step. I again understand this aligns better with the current package philosophy, and is a less risky move. If, for example, users see clearly in the readme/documentation that you strongly recommend the installation of these extra packages for performance, that'd be a good result too. Finally, and for reference, some numbers: |
I think retaining the current crc32c implementation is wise, but if users install the other crc32c package, it can defer to using that library automatically. Edit: Apparently we already do that according to https://kafka-python.readthedocs.io/en/master/install.html but I need to review the code to make sure that is still true. |
Indeed. The discussion is not about adding the dependency as an optional one (it already is) but they make it non-optional. I understand that this is not trivial though, but OTOH there's potential for performance benefits for the vast majority of users that are not installing this optional dependency. |
quick question regarding this conversation, is there already a resource that shows the difference in performance? |
We have a benchmark against So it's not exactly what you'd like to see to compare These are the results with the cython extension:
And these are with the python implementation:
That's a fairly substantial hit, specially towards bigger data sizes where the difference is ~27x. And that's with all the kafka-related overheads -- if you compared the crc32c implementations on their own the difference would be higher still. |
Forgot to mention: for reference, this is the benchmark we were using: https://gitlab.com/ska-telescope/sdp/ska-sdp-realtime-receive-processors/-/blob/8641b4c518e370a8592c3f2a0c89ab4da2e3ef60/tests/unit/test_pointing_performance.py#L48-70 |
Awesome thank you! |
Hi, thanks for reviving this project, good to see it's getting some renewed maintenance.
Lately our team has been using this package and
aiokafka
, and while doing some profiling I found they both cater for fast crc32c implementations (optional dependency oncrc32c
in this package, Cython extension inaiokafka
), while also providing python fallbacks. Depending on various factors one can end up going through these fallbacks, which hurts performance. We experienced this already, and it would have gone unnoticed unless we weren't doing this profiling.This is more a question rather than an issue (the Discussions section isn't enabled, otherwise I would've posted there): what would be your thoughts on declaring
crc32c
as a non-optional dependency? It's a small package with no dependencies on its own, provides binary wheels via cibuildwheel for all major platforms/archs, so in principle it shouldn't add issues at installation time (and even if compilation is required, only a C compiler is needed). Having it always installed would mean that the chances of getting a slow python-based crc32c impl would be null, and even when no hardware acceleration can be found, the C-based implementation should still beat the python one (I even assume that would be true when running under PyPy, since the hot loop of the checksum doesn't interact with the Python C API).Just looking for quick feedback at the moment. If the idea is well received I'd be happy to put together a pull request for this.
Full disclosure: I actually am the maintainer of
crc32c
too, which is why I took a personal interest on this when I saw it during our profiling. However I didn't know that Kafka used crc32c in general, nor that this package depended optionally oncrc32c
, so seeing this come up was a complete surprise. I also plan to have a similar discussion underaiokafka
.The text was updated successfully, but these errors were encountered: