Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc.xref_get_key for javascript cannot find object #929

Closed
juviwhale opened this issue Mar 2, 2021 · 12 comments
Closed

doc.xref_get_key for javascript cannot find object #929

juviwhale opened this issue Mar 2, 2021 · 12 comments
Assignees
Labels

Comments

@juviwhale
Copy link

Describe the bug

Some documents, that contain Javascript, throw a cannot find object in xref exception when using the .scrub method.

To Reproduce

pdf_doc = fitz.Document(stream=file_data, filetype="application/pdf")
pdf_doc.scrub(
    attached_files=True,
    clean_pages=True,
    embedded_files=True,
    javascript=True,
    metadata=True,
    xml_metadata=True,
    remove_links=True,
    # TODO: OCR'd text?
    hidden_text=True,
    # Don't remove images
    redact_images=0,
    # Don't apply redaction annotations at this point
    redactions=False,
    # Don't remove form responses
    reset_fields=False,
    reset_responses=False,
)

Unfortunately I cannot provide a document to produce this because the documents contain PII (I attached a screenshot of the exception trace, hope that is enough). The issue is not specific to a single document as we have seen the same exception occur with 468 different PDF documents.

Exception Screenshot (from Sentry)

sentry_screen_of_issue

Your configuration

3.8.2 (default, Jul 17 2020, 15:47:13) 
[Clang 11.0.3 (clang-1103.0.32.62)] 
 darwin 
 
PyMuPDF 1.18.9: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-02-26 13:46:32.
Built for Python 3.8 on darwin (64-bit).
@JorjMcKie
Copy link
Collaborator

Confusing. doc.xref_get_key(xref, ...) always succeeds, if xref points to a valid object in doc.
Indeed: the error says that "235 0 R" contains no valid PDF object definition. Is this a just created xref number without yet having been filled with any object source?

@JorjMcKie
Copy link
Collaborator

Another weird observation:
doc is a new PDF - do you need to scrub new documents?

@juviwhale
Copy link
Author

I am not sure I understand. To give more context we 'sanitize' PDFs at the beginning of our pipeline by loading each pdf, then scrub()ing, then saving the PDF for the use in the rest of the pipeline.

@JorjMcKie
Copy link
Collaborator

Anway, previously that part of the method used the outcome of doc.xref_object(xref) and iterated through its lines to find the relevant PDF keys (/JavaScript, /Metadata). That method returns the empty string "" if xref is an invalid object even though being in valid xref range.
So this type of error simply was ignored.
I am about to enforce a similar behaviour for doc.xref_get_key(xref, key): it will assume that key is not part of xref's object definition.

@JorjMcKie
Copy link
Collaborator

I am not sure I understand. To give more context we 'sanitize' PDFs at the beginning of our pipeline by loading each pdf, then scrub()ing, then saving the PDF for the use in the rest of the pipeline.

Your error can only occur if the xref I am investigating does exist, but contains no object definition. As I wrote in previous post, this situation was not detected previously.

@JorjMcKie
Copy link
Collaborator

Could you send me a printout of doc.xref_object(235) of that document?

@JorjMcKie
Copy link
Collaborator

No, wait, that would be "" right?

@JorjMcKie
Copy link
Collaborator

What I can do:

  • make a hotfix pre-version so you can download the wheel in an hour or so
  • send you an updated utils.py which wrap the relevant parts in a try-except clause

@JorjMcKie
Copy link
Collaborator

Another weird observation:
doc is a new PDF - do you need to scrub new documents?

What I was referring to: in your trace output, doc is shown as Document("", <memory, doc# 2>).

@JorjMcKie
Copy link
Collaborator

Sorry - I was wrong, forget my last post please.

@JorjMcKie
Copy link
Collaborator

There is a pre-version 1.18.10 here that should resolve the problem.

@JorjMcKie
Copy link
Collaborator

fixed in just published v1.18.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants