Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mupdf 1.19.0 builds #1313

Closed
MerlijnWajer opened this issue Oct 6, 2021 · 28 comments
Closed

mupdf 1.19.0 builds #1313

MerlijnWajer opened this issue Oct 6, 2021 · 28 comments
Assignees

Comments

@MerlijnWajer
Copy link

No particular rush, but it would be great to get (test) builds of MuPDF 1.19.0. I was going to ask for a build of the latest RC (to search for potential issues), but I see they just released 1.19.0.

I'm sure this is already in planning, but please let me know if you need any testers. I'd be happy to test.

Thanks!

@JorjMcKie
Copy link
Collaborator

👍😎
I am already testing it on Windows since a few days.
Thank you very much for your offer - this is more than welcome!
You have a Linux system I remember, Python 3.8.
I should be able to prepare a wheel for you by tomorrow.

@MerlijnWajer
Copy link
Author

Sounds great, thanks!

@caerulescens
Copy link

I'm interested in testing as well; running debian 10.

@JorjMcKie
Copy link
Collaborator

@caerulescens - Python version?

@leclerce
Copy link

leclerce commented Oct 7, 2021

I can test this as well with respect to #1311. Python 3.8 Linux as well.

Thanks!

@caerulescens
Copy link

caerulescens commented Oct 7, 2021

@JorjMcKie Python 3.8; thank you. I'm set up to build any version from source.

@JorjMcKie
Copy link
Collaborator

Guys, I am trying to compile on Linux.
It formally works - on Github action and also locally. But when I import I am seeing this:

Python 3.8.12 (default, Sep 10 2021, 00:16:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.27.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import fitz
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-5ebe19763ce1> in <module>
----> 1 import fitz

/usr/local/lib/python3.8/dist-packages/fitz/__init__.py in <module>
      8 # ------------------------------------------------------------------------
      9 import sys
---> 10 from fitz.fitz import *
     11
     12 # define the supported colorspaces for convenience

/usr/local/lib/python3.8/dist-packages/fitz/fitz.py in <module>
     13 # Import the low-level C/C++ module
     14 if __package__ or "." in __name__:
---> 15     from . import _fitz
     16 else:
     17     import _fitz

ImportError: /usr/local/lib/python3.8/dist-packages/fitz/_fitz.cpython-38-x86_64-linux-gnu.so: undefined symbol:
_ZTVN10__cxxabiv117__class_type_infoE
In [2]:

Does this ring any bells?
It seems to be a symbol connected to C++ libraries. C++ code is part of MuPDF this time because of integrated support of Tesseract, which is written in C++.
Any idea what I can do?

@MerlijnWajer
Copy link
Author

I think this might be the solution: https://stackoverflow.com/questions/47890484/undefined-symbol-during-dlopen

@MerlijnWajer
Copy link
Author

This seems to allow you tell what compiler to use: https://shwina.github.io/custom-compiler-linker-extensions/

Alternatively setting the "CC" environment might also work: https://stackoverflow.com/questions/16737260/how-to-tell-distutils-to-use-gcc

I am not sure if it's in general as just using g++ instead of gcc, but worth a shot I guess.

@JorjMcKie
Copy link
Collaborator

Thanks @MerlijnWajer - it was simple enough: just inserted the argument language="c++" in the Extension definition.
This made fitz importable. Will now make a few shakedown tests before re-building on Github.
Will drop you a note.

@JorjMcKie
Copy link
Collaborator

For those who want to do preliminary Linux Python3.8 tests of pymupdf 1.19.0:
This is the wheel wrapped in an extra zip file: "linux-wheel3x".

There are a lot of changes to "geometry" objects Rect / IRect. Also a new feature "journalling" of PDF changes. I started documenting them, which is still incomplete.
But you may find this documentation helpful already:
help-pymupdf.zip

@JorjMcKie
Copy link
Collaborator

If you want to execute tests using pytest, please use the modified scripts / files here.

@MerlijnWajer
Copy link
Author

The build works well for me, I did not have to change any code. I tested https://github.com/internetarchive/archive-pdf-tools which makes heavy use of PyMuPDF (I think), and the results for 1.18 and 1.19 are the same:

$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.19.pdf
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:00:48.921357
Finished page generation at 2021-10-09T09:00:49.035034
Creating text pages took 0.1137 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 52, 'hocr_mask_gen': 409, 'est_1': 216, 'threshold': 105, 'mask_jbig2': 21, 'fg_partial_blur': 67, 'fg_jp2': 75, 'bg_partial_blur': 60, 'bg_downsample': 31, 'bg_jp2': 30, 'page_image_insertion': 98}
Saving PDF now
Processed 4 pages at 1.36 seconds/page
Compression ratio: 6.124523
$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.18.pdf
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:02:31.586354
Finished page generation at 2021-10-09T09:02:31.703482
Creating text pages took 0.1172 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 43, 'hocr_mask_gen': 371, 'est_1': 224, 'threshold': 102, 'mask_jbig2': 20, 'fg_partial_blur': 65, 'fg_jp2': 79, 'bg_partial_blur': 58, 'bg_downsample': 31, 'bg_jp2': 26, 'page_image_insertion': 112}
Saving PDF now
Processed 4 pages at 1.34 seconds/page
Compression ratio: 6.124523
$ diff -a /tmp/out-1.18.pdf /tmp/out-1.19.pdf
404c404
<   /CreationDate (D:20211009090236Z)
---
>   /CreationDate (D:20211009090054Z)
408c408
<   /ModDate (D:20211009090236Z)
---
>   /ModDate (D:20211009090054Z)
3100,3102c3100,3102
<               <xmp:CreateDate>2021-10-09T09:02:36Z</xmp:CreateDate>
<               <xmp:MetadataDate>2021-10-09T09:02:36Z</xmp:MetadataDate>
<               <xmp:ModifyDate>2021-10-09T09:02:36Z</xmp:ModifyDate>
---
>               <xmp:CreateDate>2021-10-09T09:00:54Z</xmp:CreateDate>
>               <xmp:MetadataDate>2021-10-09T09:00:54Z</xmp:MetadataDate>
>               <xmp:ModifyDate>2021-10-09T09:00:54Z</xmp:ModifyDate>
3184c3184
<   /ID [ <FB00149CB6AA8BF2786301372183BB99> <71A032009751ABA1580D74B0D831BC32> ]
---
>   /ID [ <0EE0AAB1701209CE5AD4E85BAB7C8A86> <23A16835F2D9C99DDA9DFAF403492B3E> ]

With the added bonus that the new MuPDF ships (finally) with proper JBIG2 support, which significantly increases compression ratios:

$ recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html --scandata sim_english-illustrated-magazine_1884-12_2_15_scandata.xml -v --dpi 400 --bg-downsample 3 -m 2 -t 5 -o /tmp/out-1.19-jbig2.pdf --jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-10-09T09:01:06.757711
Finished page generation at 2021-10-09T09:01:06.878081
Creating text pages took 0.1204 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 45, 'hocr_mask_gen': 422, 'est_1': 212, 'threshold': 112, 'mask_jbig2': 97, 'fg_partial_blur': 77, 'fg_jp2': 79, 'bg_partial_blur': 61, 'bg_downsample': 31, 'bg_jp2': 33, 'page_image_insertion': 118}
Saving PDF now
Processed 4 pages at 1.46 seconds/page
Compression ratio: 8.051572

https://archive.org/~merlijn/pymupdf1.19/

Many thanks!

@JorjMcKie
Copy link
Collaborator

Thank you very much for your effort and feedback!
I am continuing my own tests and and I will also see which ones of the new features make most sense to integrate before I release this version.
Candidates are integrated OCR and onversion to DOCX and ODT documents (for LibreOffice and MS Word).
Journalling / logging already looks good.

@caerulescens
Copy link

@JorjMcKie I'm just now getting around to testing some specific errors; the link to the wheels build is broken I think. How should I go about getting the release?

@JorjMcKie
Copy link
Collaborator

Try this one, it's newer anyway: https://github.com/JorjMcKie/py-mupdf/actions/runs/1339456713

@caerulescens
Copy link

Thanks! Will do.

@caerulescens
Copy link

caerulescens commented Oct 14, 2021

All of the below are fixed with upgrading to v1.19.0:


In addition to that, an image redact annotation glitch that has been noticed in the past has been fixed. For a document containing an image mask, if the image mask is redacted, then the color is inverted, swapping white and black. I reproduced the bug and verified the fix with a document that I cannot post here. See info and screenshots below,

  • mutool info: 1 (1 0 R): [ CCITTFax ] 2544x3227 1bpc ImageMask (5 0 R)
  • page.get_image_info()
'number' = {int} 0
'bbox' = {tuple: 4} (0.0, 0.0, 595.0, 842.0)
'transform' = {tuple: 6} (595.0, 0.0, -0.0, 842.0, 0.0, 0.0)
'width' = {int} 2544
'height' = {int} 3227
'colorspace' = {int} 0
'cs-name' = {str} 'None'
'xres' = {int} 96
'yres' = {int} 96
'bpc' = {int} 1
'size' = {int} 81190
  • (v1.18.x) Portion of glitched page after placing a redact annotation anywhere on the ImageMask:
    Screenshot from 2021-10-14 15-08-55

  • (v1.19.0) Portion of correct page after placing a redact annotation anywhere on the ImageMask:
    Screenshot from 2021-10-14 15-12-44

@caerulescens
Copy link

@JorjMcKie When do you expect v1.19.0 to be released to PyPI?

@JorjMcKie
Copy link
Collaborator

@caerulescens - That looks great! Thank you very much for that thorough analysis!

@JorjMcKie When do you expect v1.19.0 to be released to PyPI?

I am hoping to do this over this weekend.

@caerulescens
Copy link

That's great; thank you!

@JorjMcKie
Copy link
Collaborator

The new version 1.19.0 has just been uploaded to PyPI for Windows and Linux on desktop systems. Linux ARM and Mac OSX wheels are currently being generated and should be available too in the next few minutes.

@mjg
Copy link
Contributor

mjg commented Oct 17, 2021

Since you've been testing: Do you have examples for the new OCR feature? (I'm the Fedora packager for mupdf and need to make sure mupdf has what PyMuPDF needs.)

@saetlan
Copy link

saetlan commented Oct 17, 2021

Hey, thanks for your work and quick updates !
Just hijacking this issue to let you know that I have unexpected behavior between 1.18 and 1.19 with first 2 coordinates being 0 when using:
page.get_text("words", flags=flag)
1.18.12
(213.79994201660156, 21.91998291015625, 220.99993896484375, 21.931982040405273, 'text', 0, 1, 0)
1.19
(0.0, 0.0, 220.99993896484375, 21.931982040405273, 'text', 0, 1, 0)

@JorjMcKie
Copy link
Collaborator

@saetlan - can you let me have the doc example please? Looks more like that words is now shifted to the left by 213.8 ...

@JorjMcKie
Copy link
Collaborator

@mjg - I will shortly publish a subversion: turns out that the OCR resolution for document pages is just too unsatisfactory: 72 dpi.
How this is currently implemented technically, there seems to exist no way to influence this.

What really works already well with OCR, is making an OCR-ed PDF page from images (pixmaps).
So I suggest to wait a handful of days with your packaging and do this with v1.19.1 which is already being tested.

The API however will remain unchanged and goes like this:

# take a pixmap of some arbitrary image
# may include a document page
# then
pix.pdfocr_save("ocr-ed.pdf", language="eng", compress=True)

The result is a 1-page PDF showing the image and having an OCR textlayer. The quality is comparable to ocrmypdf and depends on the image itself.

Version 1.19.1 will improve the fllowing:

page = doc.load_page(i)  # some page of a document to be OCR-ed
tp = page.get_textpage_ocr(language="eng", dpi=72)  # adjust resolution as desired
# now all text extraction and text search methods will work by reusing that textpage:
rectangles = page.search_for("needle", textpage=tp)
text = page.get_text("text", textpage=tp)
# etc.

The time consumer is the textpage. This is where the OCR happens. A page full of text and choosing 300 dpi may need 2 seconds to execute, which again compares well with other approaches / packages.
The good thing here is, that we have full integration in PyMuPDF scripts, which includes text processing functions that are as speedy as before.

OCR only works if an environment variable has been set that names the tessdata folder of the Tesseract installation. In Unix-like systems, this works with export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata.

Setting this must happen outside / before Python scripts can run.

@JorjMcKie
Copy link
Collaborator

@saetlan - can you let me have the doc example please? Looks more like that words is now shifted to the left by 213.8 ...

Never mind - found the error.
Thanks for submitting this. Would have been worth a real bug item 😊,
I will enter one now.

@JorjMcKie
Copy link
Collaborator

@mjg - the new v1.19.1 is currently being upload (Linux and Mac OSX are yet underway).
You may want to look at this and this Jupyter notebook to see OCR processing "at work".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants