-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness
Description
DictionaryObject.read_from_stream contains this code:
if length is None: # if the PDF is damaged
length = -1
pstart = stream.tell()
if length > 0:
data["__streamdata__"] = stream.read(length)
else:
data["__streamdata__"] = read_until_regex(
stream, re.compile(b"endstream")
)Since read_until_regex doesn't strip the trailing newline, this will read almost all length-0 streams as b"\n" or b"\r\n" instead of b"".
I have some PDFs with creator PFU ScanSnap Manager 5.1.30 #S1500 that contain JBIG2-encoded pages with /JBIG2Globals pointing to an empty stream object. After loading and saving them with pypdf, the /JBIG2Globals stream is invalid, and some (not all) PDF viewers fail to render the pages.
Suggested fix:
- If there exist broken PDFs in the wild with
/Length 0followed by a stream of nonzero length that pypdf needs to support, check forstream\r?\n\r?\n?endstreamas a special case first before falling back toread_until_regex, to ensure that valid PDFs with length-0 streams are always read correctly. - Or, if there are no such PDFs, and
length > 0was just meant to catch the-1case, change the test tolength >= 0. - In the
read_until_regexcase, ifendstreamis preceded by\rthen strip it, or if it's preceded by\r\nthen strip the\n, and strip the\ralso iffstreamwas followed by\r. That isn't guaranteed to work, but it's probably the best one can do.
Metadata
Metadata
Assignees
Labels
genericThe generic submodule is affectedThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustness