Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

baleris · 2020-09-10T09:38:14Z

i am trying to compare some pdf having extensive pages, however some pdf's having 80 pages pass successfully with current logic, but some(even though less than 50 pages) stuck at some particular page and kernal dies, its not producing any error message at all.

in my code i am reading pdf page by page and each pae has been sent for remove_txts(). in remove_txts method i want to remove all bbox from a page, it stops at some page while doing page.apply_redactions() nd kernal dies.

my partial code is like this:
def remove_txts(page):
try:
blocks = page.getTextBlocks() # get blocks of text
for block in blocks:
bbox = list(block[0:4])
rect = fitz.Rect(bbox)
page.addRedactAnnot(rect, text=" ")
page.apply_redactions()
except Exception as e:
print("Exception occured in remove texts method: " + str(e))

one of the page i have attached here, which produces same issue.
new-doc-linear-32-33edited.pdf

JorjMcKie · 2020-09-10T11:05:15Z

Thanks for submitting this - and especially with usable reproduction data!
Confirming an upstream bug which happens on the first of the two pages only - next page works.
It does look like a bug which I sent to MuPDF a few weeks ago, for which they had had a fix. Your bug now happens even with that fix built in.
So, would you agree to send the file to Artifex / MuPDF for them to investigate?

baleris · 2020-09-10T11:11:12Z

yes i agree to send this file. Thanks

JorjMcKie · 2020-09-10T11:28:15Z

Thanks! Just saw, they have even more recent updates to the C file in question. I'll re-build my local MuPDF with it and try again before bothering them with stuff already fixed in their development version.

JorjMcKie · 2020-09-10T11:58:45Z

This was a worthwhile try! The new C file does fix the bug.
How urgent is your situation? I can build a pre-version wheel for your config ...

baleris · 2020-09-10T12:11:49Z

its not that much urgent, i have 2 days time for me to complete this task. It would be great if you could able to provide me an updated patch asap. Thanks once again :)

JorjMcKie · 2020-09-10T12:29:53Z

Well, that is somewhat urgent then isn't it. I need your config: please show me the output of print(fitz.__doc__)

baleris · 2020-09-10T12:49:06Z

Here is my config:
PyMuPDF 1.17.4: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-07-20 18:09:40.
Built for Python 3.7 on win32 (64-bit).

JorjMcKie · 2020-09-10T13:26:04Z

PyMuPDF-1.17.7-cp37-cp37m-win_amd64.zip
Rename the extension ZIP to whl, then execute python -m pip install -U PyMuPDF-1.17.7-cp37-cp37m-win_amd64.whl.
==> Ready to try your notebook again!

baleris · 2020-09-14T05:02:53Z

@JorjMcKie Thanks a lot for providing quick solution along with new patch .whl . Also want to know if i want to install this in some other version of python environment like i have python 3.6.8 installed in server. is there any possibility that this PyMuPDF 1.17.7 will be available globally for compatible python 3.6.8/3.7 etc?

JorjMcKie · 2020-09-14T07:10:54Z

I iwll publish v1.17.7 some time this week. Never worked with Linux server, but the generated Linux wheels should work.

JorjMcKie · 2020-09-14T09:41:10Z

Just had another idea:
If all you want is removing all the text, then why not use this little script. It does not depend on redactions, but directly does a "search and destroy" of the PDF text objects:

import fitz

doc = fitz.open("...")
for page in doc:
    page.cleanContents()  # clean page description syntax
    xref = page.getContents()[0]  # the remaining /Contents object
    cont = bytearray(doc.xrefStream(xref))  # read as modifyable
    i1 = 0  # all text objects are wrapped in string pairs b"BT" ... b"ET"
    while i1 < len(cont):
        i1 = cont.find(b"BT")
        if i1 < 0:
            break
        i2 = cont.find(b"ET", i1)
        if i2 < 0:
            break
        cont[i1 : i2 + 2] = b""  # remove text object
    doc.updateStream(xref, cont)  # replace the /Contents
    page.cleanContents()  # remove fonts no longer used

doc.save("no-text.pdf", garbage=3, deflate=True)

JorjMcKie · 2020-09-14T11:52:44Z

New version 1.17.7 is being uploaded right now.

baleris added the question label Sep 10, 2020

baleris assigned JorjMcKie Sep 10, 2020

JorjMcKie added upstream bug bug outside this package and removed question labels Sep 10, 2020

JorjMcKie closed this as completed Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

baleris commented Sep 10, 2020 •

edited

Loading

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020 •

edited

Loading

JorjMcKie commented Sep 14, 2020

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

Question: Jupyter Kernal Dies after reading some pages in pyMuPdf #651

Comments

baleris commented Sep 10, 2020 • edited Loading

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 10, 2020

JorjMcKie commented Sep 10, 2020

baleris commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020 • edited Loading

JorjMcKie commented Sep 14, 2020

baleris commented Sep 10, 2020 •

edited

Loading

JorjMcKie commented Sep 14, 2020 •

edited

Loading