mupdf 1.19.0 builds #1313

MerlijnWajer · 2021-10-06T13:46:24Z

No particular rush, but it would be great to get (test) builds of MuPDF 1.19.0. I was going to ask for a build of the latest RC (to search for potential issues), but I see they just released 1.19.0.

I'm sure this is already in planning, but please let me know if you need any testers. I'd be happy to test.

Thanks!

JorjMcKie · 2021-10-06T19:00:43Z

👍😎
I am already testing it on Windows since a few days.
Thank you very much for your offer - this is more than welcome!
You have a Linux system I remember, Python 3.8.
I should be able to prepare a wheel for you by tomorrow.

MerlijnWajer · 2021-10-06T19:01:54Z

Sounds great, thanks!

caerulescens · 2021-10-07T13:40:14Z

I'm interested in testing as well; running debian 10.

JorjMcKie · 2021-10-07T14:31:35Z

@caerulescens - Python version?

leclerce · 2021-10-07T16:34:04Z

I can test this as well with respect to #1311. Python 3.8 Linux as well.

Thanks!

caerulescens · 2021-10-07T17:49:34Z

@JorjMcKie Python 3.8; thank you. I'm set up to build any version from source.

JorjMcKie · 2021-10-08T07:55:00Z

Guys, I am trying to compile on Linux.
It formally works - on Github action and also locally. But when I import I am seeing this:

Python 3.8.12 (default, Sep 10 2021, 00:16:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import fitz
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-5ebe19763ce1> in <module>
----> 1 import fitz

/usr/local/lib/python3.8/dist-packages/fitz/__init__.py in <module>
      8 # ------------------------------------------------------------------------
      9 import sys
---> 10 from fitz.fitz import *
     11
     12 # define the supported colorspaces for convenience

/usr/local/lib/python3.8/dist-packages/fitz/fitz.py in <module>
     13 # Import the low-level C/C++ module
     14 if __package__ or "." in __name__:
---> 15     from . import _fitz
     16 else:
     17     import _fitz

ImportError: /usr/local/lib/python3.8/dist-packages/fitz/_fitz.cpython-38-x86_64-linux-gnu.so: undefined symbol:
_ZTVN10__cxxabiv117__class_type_infoE
In [2]:

Does this ring any bells?
It seems to be a symbol connected to C++ libraries. C++ code is part of MuPDF this time because of integrated support of Tesseract, which is written in C++.
Any idea what I can do?

MerlijnWajer · 2021-10-08T09:16:04Z

I think this might be the solution: https://stackoverflow.com/questions/47890484/undefined-symbol-during-dlopen

MerlijnWajer · 2021-10-08T09:17:28Z

This seems to allow you tell what compiler to use: https://shwina.github.io/custom-compiler-linker-extensions/

Alternatively setting the "CC" environment might also work: https://stackoverflow.com/questions/16737260/how-to-tell-distutils-to-use-gcc

I am not sure if it's in general as just using g++ instead of gcc, but worth a shot I guess.

JorjMcKie · 2021-10-08T09:32:16Z

Thanks @MerlijnWajer - it was simple enough: just inserted the argument language="c++" in the Extension definition.
This made fitz importable. Will now make a few shakedown tests before re-building on Github.
Will drop you a note.

JorjMcKie · 2021-10-08T10:01:15Z

For those who want to do preliminary Linux Python3.8 tests of pymupdf 1.19.0:
This is the wheel wrapped in an extra zip file: "linux-wheel3x".

There are a lot of changes to "geometry" objects Rect / IRect. Also a new feature "journalling" of PDF changes. I started documenting them, which is still incomplete.
But you may find this documentation helpful already:
help-pymupdf.zip

JorjMcKie · 2021-10-08T10:06:22Z

If you want to execute tests using pytest, please use the modified scripts / files here.

MerlijnWajer · 2021-10-09T09:10:06Z

The build works well for me, I did not have to change any code. I tested https://github.com/internetarchive/archive-pdf-tools which makes heavy use of PyMuPDF (I think), and the results for 1.18 and 1.19 are the same:

$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.19.pdf
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:00:48.921357
Finished page generation at 2021-10-09T09:00:49.035034
Creating text pages took 0.1137 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 52, 'hocr_mask_gen': 409, 'est_1': 216, 'threshold': 105, 'mask_jbig2': 21, 'fg_partial_blur': 67, 'fg_jp2': 75, 'bg_partial_blur': 60, 'bg_downsample': 31, 'bg_jp2': 30, 'page_image_insertion': 98}
Saving PDF now
Processed 4 pages at 1.36 seconds/page
Compression ratio: 6.124523

$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.18.pdf
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:02:31.586354
Finished page generation at 2021-10-09T09:02:31.703482
Creating text pages took 0.1172 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 43, 'hocr_mask_gen': 371, 'est_1': 224, 'threshold': 102, 'mask_jbig2': 20, 'fg_partial_blur': 65, 'fg_jp2': 79, 'bg_partial_blur': 58, 'bg_downsample': 31, 'bg_jp2': 26, 'page_image_insertion': 112}
Saving PDF now
Processed 4 pages at 1.34 seconds/page
Compression ratio: 6.124523

$ diff -a /tmp/out-1.18.pdf /tmp/out-1.19.pdf
404c404
<   /CreationDate (D:20211009090236Z)
---
>   /CreationDate (D:20211009090054Z)
408c408
<   /ModDate (D:20211009090236Z)
---
>   /ModDate (D:20211009090054Z)
3100,3102c3100,3102
<               <xmp:CreateDate>2021-10-09T09:02:36Z</xmp:CreateDate>
<               <xmp:MetadataDate>2021-10-09T09:02:36Z</xmp:MetadataDate>
<               <xmp:ModifyDate>2021-10-09T09:02:36Z</xmp:ModifyDate>
---
>               <xmp:CreateDate>2021-10-09T09:00:54Z</xmp:CreateDate>
>               <xmp:MetadataDate>2021-10-09T09:00:54Z</xmp:MetadataDate>
>               <xmp:ModifyDate>2021-10-09T09:00:54Z</xmp:ModifyDate>
3184c3184
<   /ID [ <FB00149CB6AA8BF2786301372183BB99> <71A032009751ABA1580D74B0D831BC32> ]
---
>   /ID [ <0EE0AAB1701209CE5AD4E85BAB7C8A86> <23A16835F2D9C99DDA9DFAF403492B3E> ]

With the added bonus that the new MuPDF ships (finally) with proper JBIG2 support, which significantly increases compression ratios:

$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.19-jbig2.pdf --jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:01:06.757711
Finished page generation at 2021-10-09T09:01:06.878081
Creating text pages took 0.1204 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 45, 'hocr_mask_gen': 422, 'est_1': 212, 'threshold': 112, 'mask_jbig2': 97, 'fg_partial_blur': 77, 'fg_jp2': 79, 'bg_partial_blur': 61, 'bg_downsample': 31, 'bg_jp2': 33, 'page_image_insertion': 118}
Saving PDF now
Processed 4 pages at 1.46 seconds/page
Compression ratio: 8.051572

https://archive.org/~merlijn/pymupdf1.19/

Many thanks!

JorjMcKie · 2021-10-09T10:11:29Z

Thank you very much for your effort and feedback!
I am continuing my own tests and and I will also see which ones of the new features make most sense to integrate before I release this version.
Candidates are integrated OCR and onversion to DOCX and ODT documents (for LibreOffice and MS Word).
Journalling / logging already looks good.

caerulescens · 2021-10-14T13:05:32Z

@JorjMcKie I'm just now getting around to testing some specific errors; the link to the wheels build is broken I think. How should I go about getting the release?

JorjMcKie · 2021-10-14T13:56:34Z

Try this one, it's newer anyway: https://github.com/JorjMcKie/py-mupdf/actions/runs/1339456713

caerulescens · 2021-10-14T14:17:33Z

Thanks! Will do.

caerulescens · 2021-10-14T19:47:00Z

All of the below are fixed with upgrading to v1.19.0:

In addition to that, an image redact annotation glitch that has been noticed in the past has been fixed. For a document containing an image mask, if the image mask is redacted, then the color is inverted, swapping white and black. I reproduced the bug and verified the fix with a document that I cannot post here. See info and screenshots below,

mutool info: 1 (1 0 R): [ CCITTFax ] 2544x3227 1bpc ImageMask (5 0 R)
page.get_image_info()

'number' = {int} 0
'bbox' = {tuple: 4} (0.0, 0.0, 595.0, 842.0)
'transform' = {tuple: 6} (595.0, 0.0, -0.0, 842.0, 0.0, 0.0)
'width' = {int} 2544
'height' = {int} 3227
'colorspace' = {int} 0
'cs-name' = {str} 'None'
'xres' = {int} 96
'yres' = {int} 96
'bpc' = {int} 1
'size' = {int} 81190

(v1.18.x) Portion of glitched page after placing a redact annotation anywhere on the ImageMask:
(v1.19.0) Portion of correct page after placing a redact annotation anywhere on the ImageMask:

caerulescens · 2021-10-14T19:53:34Z

@JorjMcKie When do you expect v1.19.0 to be released to PyPI?

JorjMcKie · 2021-10-14T19:55:09Z

@caerulescens - That looks great! Thank you very much for that thorough analysis!

@JorjMcKie When do you expect v1.19.0 to be released to PyPI?

I am hoping to do this over this weekend.

caerulescens · 2021-10-14T19:55:36Z

That's great; thank you!

JorjMcKie · 2021-10-17T11:18:38Z

The new version 1.19.0 has just been uploaded to PyPI for Windows and Linux on desktop systems. Linux ARM and Mac OSX wheels are currently being generated and should be available too in the next few minutes.

mjg · 2021-10-17T14:51:49Z

Since you've been testing: Do you have examples for the new OCR feature? (I'm the Fedora packager for mupdf and need to make sure mupdf has what PyMuPDF needs.)

saetlan · 2021-10-17T22:28:25Z

Hey, thanks for your work and quick updates !
Just hijacking this issue to let you know that I have unexpected behavior between 1.18 and 1.19 with first 2 coordinates being 0 when using:
page.get_text("words", flags=flag)
1.18.12
(213.79994201660156, 21.91998291015625, 220.99993896484375, 21.931982040405273, 'text', 0, 1, 0)
1.19
(0.0, 0.0, 220.99993896484375, 21.931982040405273, 'text', 0, 1, 0)

JorjMcKie · 2021-10-18T04:01:18Z

@saetlan - can you let me have the doc example please? Looks more like that words is now shifted to the left by 213.8 ...

JorjMcKie · 2021-10-18T07:52:36Z

@mjg - I will shortly publish a subversion: turns out that the OCR resolution for document pages is just too unsatisfactory: 72 dpi.
How this is currently implemented technically, there seems to exist no way to influence this.

What really works already well with OCR, is making an OCR-ed PDF page from images (pixmaps).
So I suggest to wait a handful of days with your packaging and do this with v1.19.1 which is already being tested.

The API however will remain unchanged and goes like this:

# take a pixmap of some arbitrary image
# may include a document page
# then
pix.pdfocr_save("ocr-ed.pdf", language="eng", compress=True)

The result is a 1-page PDF showing the image and having an OCR textlayer. The quality is comparable to ocrmypdf and depends on the image itself.

Version 1.19.1 will improve the fllowing:

page = doc.load_page(i)  # some page of a document to be OCR-ed
tp = page.get_textpage_ocr(language="eng", dpi=72)  # adjust resolution as desired
# now all text extraction and text search methods will work by reusing that textpage:
rectangles = page.search_for("needle", textpage=tp)
text = page.get_text("text", textpage=tp)
# etc.

The time consumer is the textpage. This is where the OCR happens. A page full of text and choosing 300 dpi may need 2 seconds to execute, which again compares well with other approaches / packages.
The good thing here is, that we have full integration in PyMuPDF scripts, which includes text processing functions that are as speedy as before.

OCR only works if an environment variable has been set that names the tessdata folder of the Tesseract installation. In Unix-like systems, this works with export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata.

Setting this must happen outside / before Python scripts can run.

JorjMcKie · 2021-10-18T10:53:20Z

@saetlan - can you let me have the doc example please? Looks more like that words is now shifted to the left by 213.8 ...

Never mind - found the error.
Thanks for submitting this. Would have been worth a real bug item 😊,
I will enter one now.

JorjMcKie · 2021-10-24T11:09:35Z

@mjg - the new v1.19.1 is currently being upload (Linux and Mac OSX are yet underway).
You may want to look at this and this Jupyter notebook to see OCR processing "at work".

MerlijnWajer added the enhancement label Oct 6, 2021

MerlijnWajer assigned JorjMcKie Oct 6, 2021

JorjMcKie closed this as completed Oct 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mupdf 1.19.0 builds #1313

mupdf 1.19.0 builds #1313

MerlijnWajer commented Oct 6, 2021

JorjMcKie commented Oct 6, 2021

MerlijnWajer commented Oct 6, 2021

caerulescens commented Oct 7, 2021

JorjMcKie commented Oct 7, 2021

leclerce commented Oct 7, 2021

caerulescens commented Oct 7, 2021 •

edited

Loading

JorjMcKie commented Oct 8, 2021

MerlijnWajer commented Oct 8, 2021

MerlijnWajer commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

MerlijnWajer commented Oct 9, 2021

JorjMcKie commented Oct 9, 2021

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 14, 2021

caerulescens commented Oct 14, 2021

caerulescens commented Oct 14, 2021 •

edited

Loading

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 14, 2021

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 17, 2021

mjg commented Oct 17, 2021

saetlan commented Oct 17, 2021 •

edited

Loading

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 24, 2021

mupdf 1.19.0 builds #1313

mupdf 1.19.0 builds #1313

Comments

MerlijnWajer commented Oct 6, 2021

JorjMcKie commented Oct 6, 2021

MerlijnWajer commented Oct 6, 2021

caerulescens commented Oct 7, 2021

JorjMcKie commented Oct 7, 2021

leclerce commented Oct 7, 2021

caerulescens commented Oct 7, 2021 • edited Loading

JorjMcKie commented Oct 8, 2021

MerlijnWajer commented Oct 8, 2021

MerlijnWajer commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

JorjMcKie commented Oct 8, 2021

MerlijnWajer commented Oct 9, 2021

JorjMcKie commented Oct 9, 2021

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 14, 2021

caerulescens commented Oct 14, 2021

caerulescens commented Oct 14, 2021 • edited Loading

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 14, 2021

caerulescens commented Oct 14, 2021

JorjMcKie commented Oct 17, 2021

mjg commented Oct 17, 2021

saetlan commented Oct 17, 2021 • edited Loading

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 18, 2021

JorjMcKie commented Oct 24, 2021

caerulescens commented Oct 7, 2021 •

edited

Loading

caerulescens commented Oct 14, 2021 •

edited

Loading

saetlan commented Oct 17, 2021 •

edited

Loading