Description
I recently switched processing .tgz archives from "native" tarfile
to tarfile
+ indexed_gzip
, to improve performance because my application does a lot of seeking back and forth.
During testing, reading inside a deeply nested archive (path like zip.tgz.tar.bz2.tar.gz//home/zip.tgz.tar.bz2//home/zip.tgz//home/home.zip
) fails with:
Traceback (most recent call last):
File "/usr/lib/python3.10/zipfile.py", line 925, in read
data = self._read1(n)
File "/usr/lib/python3.10/zipfile.py", line 995, in _read1
data = self._read2(n)
File "/usr/lib/python3.10/zipfile.py", line 1025, in _read2
data = self._fileobj.read(n)
File "/usr/lib/python3.10/zipfile.py", line 745, in read
data = self._file.read(n)
File "/usr/lib/python3.10/tarfile.py", line 696, in readinto
buf = self.read(len(b))
File "/usr/lib/python3.10/tarfile.py", line 685, in read
b = self.fileobj.read(length)
File "indexed_gzip/indexed_gzip.pyx", line 797, in indexed_gzip.indexed_gzip._IndexedGzipFile.readinto
indexed_gzip.indexed_gzip.ZranError: zran_read returned error: ZRAN_READ_FAIL (file: <ExFileObject name=None>)
The archive is 100% valid and can be opened / read with Python's built-in tarfile
and zipfile
.
To open the tar.gz members, I'm using tarfile.open(fileobj=IndexedGzipFile(fileobj=file_obj, mode='rb', auto_build=True), mode='r')
.
I realize that reading nested archives on-the-fly like that is a bit unusual, but this should work, right? Is igzip
a drop-in replacement for gzip
or are there any gotchas?
I'll try to create a minimal reproducing dataset, but any clues about how to debug this problem would be welcome. Why would ZRAN_READ_FAIL
ever fail on a valid archive?