Rewrite gzip._GzipReader in C for more performance and less overhead #110283

rhpvorderman · 2023-10-03T13:26:50Z

Feature or enhancement

Proposal:

gzip._GzipReader is used as a raw IO for a io.BufferedReader in GzipFile. By rewriting it in C the following things can be achieved:

Much less pure python overhead and much better performance. Currently it is approximately 10% slower than pigz single-thread for reading. Figures from 1-3% are attainable.
The C GzipReader can be used as backend for gzip.decompress. Currently decompress has a different implementation to avoid initializing a huge pure python class with inheritance, which makes the overhead for simply decompressing a block quite great. By using a C gzipreader the code can be reused without this overhead.

Problems with this approach:

RawIOBase cannot be inherited as easily, so methods must be implemented manually or skipped.

The performance gains can be quite substantial: see this PR for python-isal: pycompression/python-isal#151 . EDIT: I also made the _GzipReader accept buffer protocol objects and use the internal buffer as read buffer: pycompression/python-isal#152 . As a result the overhead of reading a bytes-like object is now minimal, requiring now io.BytesIO object. This has massively reduced overhead.

The code from python-isal can be copied almost verbatim into CPython is they share the same license. (I gave python-isal the same license to make code exchange easy.) The code can be battle-tested a bit in python-isal before it is released into cpython.

It will also make it easier to implement the gzip header reading implemented in zlib as mentioned in #103477 .
Code for checking the header CRC is already implemented, this would make Python's gzip reader more spec compliant. #89672

Ping @gpshead, is this something that could be integrated into CPython, or would this create too many backwards-compatibility headaches?

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

rhpvorderman · 2023-12-26T09:42:17Z

I also implemented this on python-zlib-ng. https://github.com/pycompression/python-zlib-ng/pull/26/files

The code can almost be copied verbatim into cpython. If there is interest in this performance improvement.

I think the primary question is not whether this is a good idea, bit whether it is a good idea for CPython given backwards compatibility constraints etc.
Since _GzipReader is a private member of the module, and not part of the interface, I think it is a worthwile idea.

encukou · 2024-01-25T14:45:41Z

Per PEP-399, we might need to maintain the pure-Python implementation even if a C one is merged in. That makes the proposal quite heavy from a maintenance POV.
On the other hand, the speedup is quite significant.

It might be best to first add a benchmark to pyperformance, so the faster-cpython team can optimize for it. It seems specialization or a JIT could help with this kind of code.

rhpvorderman · 2024-01-26T07:53:39Z

@encokou, ah I see. Well in that case it is really not worth 10% of extra performance. Gzip performance is presumably not a big issue for most python users and 10% is definitely not worth maintaing both a C and Python implementation as is the case with the IO modules.

I will simply keep maintaining the C implementation in python-isal and python-zlib-ng and users who require performance can fall back on those third-party extensions.

Thanks for the feedback!

rhpvorderman added the type-feature A feature request or enhancement label Oct 3, 2023

iritkatriel added the stdlib Python modules in the Lib dir label Nov 27, 2023

AlexWaygood added the performance Performance or resource usage label Jan 25, 2024

rhpvorderman closed this as completed Jan 26, 2024

gpshead closed this as not planned Won't fix, can't repro, duplicate, stale Jan 26, 2024

rhpvorderman mentioned this issue Jun 15, 2024

gh-112346: Always set OS byte to 255, simpler gzip.compress function. #120486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite gzip._GzipReader in C for more performance and less overhead #110283

Rewrite gzip._GzipReader in C for more performance and less overhead #110283

rhpvorderman commented Oct 3, 2023 •

edited

Loading

rhpvorderman commented Dec 26, 2023

encukou commented Jan 25, 2024

rhpvorderman commented Jan 26, 2024

Rewrite gzip._GzipReader in C for more performance and less overhead #110283

Rewrite gzip._GzipReader in C for more performance and less overhead #110283

Comments

rhpvorderman commented Oct 3, 2023 • edited Loading

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

rhpvorderman commented Dec 26, 2023

encukou commented Jan 25, 2024

rhpvorderman commented Jan 26, 2024

rhpvorderman commented Oct 3, 2023 •

edited

Loading