Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite gzip._GzipReader in C for more performance and less overhead #110283

Closed
rhpvorderman opened this issue Oct 3, 2023 · 3 comments
Closed
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Oct 3, 2023

Feature or enhancement

Proposal:

gzip._GzipReader is used as a raw IO for a io.BufferedReader in GzipFile. By rewriting it in C the following things can be achieved:

  • Much less pure python overhead and much better performance. Currently it is approximately 10% slower than pigz single-thread for reading. Figures from 1-3% are attainable.
  • The C GzipReader can be used as backend for gzip.decompress. Currently decompress has a different implementation to avoid initializing a huge pure python class with inheritance, which makes the overhead for simply decompressing a block quite great. By using a C gzipreader the code can be reused without this overhead.

Problems with this approach:

  • RawIOBase cannot be inherited as easily, so methods must be implemented manually or skipped.

The performance gains can be quite substantial: see this PR for python-isal: pycompression/python-isal#151 . EDIT: I also made the _GzipReader accept buffer protocol objects and use the internal buffer as read buffer: pycompression/python-isal#152 . As a result the overhead of reading a bytes-like object is now minimal, requiring now io.BytesIO object. This has massively reduced overhead.

The code from python-isal can be copied almost verbatim into CPython is they share the same license. (I gave python-isal the same license to make code exchange easy.) The code can be battle-tested a bit in python-isal before it is released into cpython.

It will also make it easier to implement the gzip header reading implemented in zlib as mentioned in #103477 .
Code for checking the header CRC is already implemented, this would make Python's gzip reader more spec compliant. #89672

Ping @gpshead, is this something that could be integrated into CPython, or would this create too many backwards-compatibility headaches?

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

@rhpvorderman rhpvorderman added the type-feature A feature request or enhancement label Oct 3, 2023
@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 27, 2023
@rhpvorderman
Copy link
Contributor Author

I also implemented this on python-zlib-ng. https://github.com/pycompression/python-zlib-ng/pull/26/files

The code can almost be copied verbatim into cpython. If there is interest in this performance improvement.

I think the primary question is not whether this is a good idea, bit whether it is a good idea for CPython given backwards compatibility constraints etc.
Since _GzipReader is a private member of the module, and not part of the interface, I think it is a worthwile idea.

@encukou
Copy link
Member

encukou commented Jan 25, 2024

Per PEP-399, we might need to maintain the pure-Python implementation even if a C one is merged in. That makes the proposal quite heavy from a maintenance POV.
On the other hand, the speedup is quite significant.

It might be best to first add a benchmark to pyperformance, so the faster-cpython team can optimize for it. It seems specialization or a JIT could help with this kind of code.

@AlexWaygood AlexWaygood added the performance Performance or resource usage label Jan 25, 2024
@rhpvorderman
Copy link
Contributor Author

@encokou, ah I see. Well in that case it is really not worth 10% of extra performance. Gzip performance is presumably not a big issue for most python users and 10% is definitely not worth maintaing both a C and Python implementation as is the case with the IO modules.

I will simply keep maintaining the C implementation in python-isal and python-zlib-ng and users who require performance can fall back on those third-party extensions.

Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants