Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

Closed
baleris opened this issue Sep 10, 2020 · 12 comments
Closed

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

baleris opened this issue Sep 10, 2020 · 12 comments
Assignees
Labels
upstream bug bug outside this package

Comments

@baleris
Copy link

baleris commented Sep 10, 2020

i am trying to compare some pdf having extensive pages, however some pdf's having 80 pages pass successfully with current logic, but some(even though less than 50 pages) stuck at some particular page and kernal dies, its not producing any error message at all.

in my code i am reading pdf page by page and each pae has been sent for remove_txts(). in remove_txts method i want to remove all bbox from a page, it stops at some page while doing page.apply_redactions() nd kernal dies.

my partial code is like this:
def remove_txts(page):
try:
blocks = page.getTextBlocks() # get blocks of text
for block in blocks:
bbox = list(block[0:4])
rect = fitz.Rect(bbox)
page.addRedactAnnot(rect, text=" ")
page.apply_redactions()
except Exception as e:
print("Exception occured in remove texts method: " + str(e))

one of the page i have attached here, which produces same issue.
new-doc-linear-32-33edited.pdf

@JorjMcKie
Copy link
Collaborator

Thanks for submitting this - and especially with usable reproduction data!
Confirming an upstream bug which happens on the first of the two pages only - next page works.
It does look like a bug which I sent to MuPDF a few weeks ago, for which they had had a fix. Your bug now happens even with that fix built in.
So, would you agree to send the file to Artifex / MuPDF for them to investigate?

@JorjMcKie JorjMcKie added upstream bug bug outside this package and removed question labels Sep 10, 2020
@baleris
Copy link
Author

baleris commented Sep 10, 2020

yes i agree to send this file. Thanks

@JorjMcKie
Copy link
Collaborator

Thanks! Just saw, they have even more recent updates to the C file in question. I'll re-build my local MuPDF with it and try again before bothering them with stuff already fixed in their development version.

@JorjMcKie
Copy link
Collaborator

This was a worthwhile try! The new C file does fix the bug.
How urgent is your situation? I can build a pre-version wheel for your config ...

@baleris
Copy link
Author

baleris commented Sep 10, 2020

its not that much urgent, i have 2 days time for me to complete this task. It would be great if you could able to provide me an updated patch asap. Thanks once again :)

@JorjMcKie
Copy link
Collaborator

Well, that is somewhat urgent then isn't it. I need your config: please show me the output of print(fitz.__doc__)

@baleris
Copy link
Author

baleris commented Sep 10, 2020

Here is my config:
PyMuPDF 1.17.4: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-07-20 18:09:40.
Built for Python 3.7 on win32 (64-bit).

@JorjMcKie
Copy link
Collaborator

PyMuPDF-1.17.7-cp37-cp37m-win_amd64.zip
Rename the extension ZIP to whl, then execute python -m pip install -U PyMuPDF-1.17.7-cp37-cp37m-win_amd64.whl.
==> Ready to try your notebook again!

@baleris
Copy link
Author

baleris commented Sep 14, 2020

@JorjMcKie Thanks a lot for providing quick solution along with new patch .whl . Also want to know if i want to install this in some other version of python environment like i have python 3.6.8 installed in server. is there any possibility that this PyMuPDF 1.17.7 will be available globally for compatible python 3.6.8/3.7 etc?

@JorjMcKie
Copy link
Collaborator

I iwll publish v1.17.7 some time this week. Never worked with Linux server, but the generated Linux wheels should work.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Sep 14, 2020

Just had another idea:
If all you want is removing all the text, then why not use this little script. It does not depend on redactions, but directly does a "search and destroy" of the PDF text objects:

import fitz

doc = fitz.open("...")
for page in doc:
    page.cleanContents()  # clean page description syntax
    xref = page.getContents()[0]  # the remaining /Contents object
    cont = bytearray(doc.xrefStream(xref))  # read as modifyable
    i1 = 0  # all text objects are wrapped in string pairs b"BT" ... b"ET"
    while i1 < len(cont):
        i1 = cont.find(b"BT")
        if i1 < 0:
            break
        i2 = cont.find(b"ET", i1)
        if i2 < 0:
            break
        cont[i1 : i2 + 2] = b""  # remove text object
    doc.updateStream(xref, cont)  # replace the /Contents
    page.cleanContents()  # remove fonts no longer used

doc.save("no-text.pdf", garbage=3, deflate=True)

@JorjMcKie
Copy link
Collaborator

New version 1.17.7 is being uploaded right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants