Skip to content

Faster decompression of gzip files #95534

Closed
@rhpvorderman

Description

@rhpvorderman

Pitch

Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip support is advertised via headers). Tar.gz files are an extremely common way to archive files. Zip files use internal gzip compression.

Speeding this up by a non-trivial amount is therefore very advantageous.

Feature or enhancement

The current gzip reading pipeline can be improved quite a lot. This is the current way of doing things:

  • read io.DEFAULT_BUFFER_SIZE of data from a _PaddedFile object
  • feed it to a zlib.decompressobj() using the decompress(raw_data, size) function.
  • Internally decompress always starts with a 16KB buffer, regardless of the requested size. When the output data is 64KB big, it will need to resize 2 times.
  • The decompressed data is returned, anything not returned is saved in an unconsumed_tail object.
  • The unconsumed_tail is used to rewind the _PaddedFile object to the correct position.
  • The decompressed data length and crc32 are taken and these are used to update the GzipReader state.

This has some severe disadvantages when reading large blocks:

  • Gzip compresses between 50 and 90% in most cases. This means that the decompress function is going to return anywhere between 16 and 80KB maximum. When 128 KB is requested from the read function this means the 128 KB is not going to be filled, leading to unnecessary read requests.
  • In the above case say the typical return size is 37kb. That means due to the DEFAULT_BUFFER_SIZE in zlib being 16KB there will be two calls to resize the memory of the return object. (16->32->64). EDIT: actually it is three, since the end product is also resized to fit the contents (16->32->64->37).

This also has some severe disadvantages when reading small blocks.

  • When reading individual lines from a file (quite common) this actually queries a io.BufferedReader instance which reads from the _GzipReader in io.DEFAULT_BUFFER_SIZE chunks.
  • This means only 8KB requested. But still, 8KB is read from the _PaddedFile object. With typical compression rates this means anywhere between 4KB to 7KB of unconsumed tail will be returned. This creates a new object. Meaning allocating memory again. The same data is reread in the next iteration. This means in case of a 70% decompression rate, the same data is memcpy'd around 2 to 3 times.
  • Continuous rewinding of the _PaddedFile object

How to improve this:

  • Use the structure of the _bz2.Bz2Decompressor. Submitted data is copied to an internal buffer, data from this internal buffer is used to return decompress calls. This has the advantage of not endlessly allocating and destroing uncompressed_tail objects.
  • Read 128KB at once from the _PaddedFile object. And only read this data when the decompressor's needs_input attribute is True. This prevents querying the _PaddedFile object too much.
  • When decompress is called and a maxsize is given (say 128KB) allocate maxsize immediately, instead of allocating only 16KB and resizing later.

This prevents a lot of calls to the Python memory allocator.

This restructuring has already been implemented in python-isal. That project is a modification of zlibmodule.c to use the ISA-L optimizations. While this did improve speed, I also looked at other ways to improve the performance. By restructuring the gzip module and the zlib code the Python overhead was significantly reduced.

Relevant code:

Most of this code can be seemlessly copied back into CPython. Which I will do when I have the time. This can best be done after the 3.11 release I think.

Previous discussion

NA. This is a performance enhancement, so not necessarily a new feature, but also not a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    3.12only security fixesperformancePerformance or resource usagestdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions