doc.xref_get_key for javascript cannot find object #929

juviwhale · 2021-03-02T06:51:57Z

Describe the bug

Some documents, that contain Javascript, throw a cannot find object in xref exception when using the .scrub method.

To Reproduce

pdf_doc = fitz.Document(stream=file_data, filetype="application/pdf")
pdf_doc.scrub(
    attached_files=True,
    clean_pages=True,
    embedded_files=True,
    javascript=True,
    metadata=True,
    xml_metadata=True,
    remove_links=True,
    # TODO: OCR'd text?
    hidden_text=True,
    # Don't remove images
    redact_images=0,
    # Don't apply redaction annotations at this point
    redactions=False,
    # Don't remove form responses
    reset_fields=False,
    reset_responses=False,
)

Unfortunately I cannot provide a document to produce this because the documents contain PII (I attached a screenshot of the exception trace, hope that is enough). The issue is not specific to a single document as we have seen the same exception occur with 468 different PDF documents.

Exception Screenshot (from Sentry)

Your configuration

3.8.2 (default, Jul 17 2020, 15:47:13) 
[Clang 11.0.3 (clang-1103.0.32.62)] 
 darwin 
 
PyMuPDF 1.18.9: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-02-26 13:46:32.
Built for Python 3.8 on darwin (64-bit).

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2021-03-02T08:40:31Z

Confusing. doc.xref_get_key(xref, ...) always succeeds, if xref points to a valid object in doc.
Indeed: the error says that "235 0 R" contains no valid PDF object definition. Is this a just created xref number without yet having been filled with any object source?

JorjMcKie · 2021-03-02T09:12:38Z

Another weird observation:
doc is a new PDF - do you need to scrub new documents?

juviwhale · 2021-03-02T09:19:41Z

I am not sure I understand. To give more context we 'sanitize' PDFs at the beginning of our pipeline by loading each pdf, then scrub()ing, then saving the PDF for the use in the rest of the pipeline.

JorjMcKie · 2021-03-02T09:20:10Z

Anway, previously that part of the method used the outcome of doc.xref_object(xref) and iterated through its lines to find the relevant PDF keys (/JavaScript, /Metadata). That method returns the empty string "" if xref is an invalid object even though being in valid xref range.
So this type of error simply was ignored.
I am about to enforce a similar behaviour for doc.xref_get_key(xref, key): it will assume that key is not part of xref's object definition.

JorjMcKie · 2021-03-02T09:21:49Z

I am not sure I understand. To give more context we 'sanitize' PDFs at the beginning of our pipeline by loading each pdf, then scrub()ing, then saving the PDF for the use in the rest of the pipeline.

Your error can only occur if the xref I am investigating does exist, but contains no object definition. As I wrote in previous post, this situation was not detected previously.

JorjMcKie · 2021-03-02T09:26:00Z

Could you send me a printout of doc.xref_object(235) of that document?

JorjMcKie · 2021-03-02T09:28:52Z

No, wait, that would be "" right?

JorjMcKie · 2021-03-02T09:31:13Z

What I can do:

make a hotfix pre-version so you can download the wheel in an hour or so
send you an updated utils.py which wrap the relevant parts in a try-except clause

JorjMcKie · 2021-03-02T09:34:52Z

Another weird observation:
doc is a new PDF - do you need to scrub new documents?

What I was referring to: in your trace output, doc is shown as Document("", <memory, doc# 2>).

JorjMcKie · 2021-03-02T13:07:44Z

Sorry - I was wrong, forget my last post please.

JorjMcKie · 2021-03-02T21:22:20Z

There is a pre-version 1.18.10 here that should resolve the problem.

JorjMcKie · 2021-03-22T14:14:53Z

fixed in just published v1.18.10

juviwhale added the bug label Mar 2, 2021

juviwhale assigned JorjMcKie Mar 2, 2021

JorjMcKie closed this as completed Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc.xref_get_key for javascript cannot find object #929

doc.xref_get_key for javascript cannot find object #929

juviwhale commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

juviwhale commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 22, 2021

doc.xref_get_key for javascript cannot find object #929

doc.xref_get_key for javascript cannot find object #929

Comments

juviwhale commented Mar 2, 2021

Describe the bug

To Reproduce

Exception Screenshot (from Sentry)

Your configuration

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

juviwhale commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 2, 2021

JorjMcKie commented Mar 22, 2021