Closed
Description
Description of the bug
When upgrading PyMuPDF from 1.24.14 to 1.25.0, the reported text color codes have changed.
I tested this with this code:
import pymupdf
print(pymupdf.__version__)
flags = pymupdf.TEXT_PRESERVE_IMAGES | pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE
doc = pymupdf.open("0d4cb925de9d383e.pdf")
page = doc[0]
dicts = page.get_text('dict', flags=flags, sort=True)
seen = set()
for b_ctr, b in enumerate(dicts['blocks']):
for l_ctr, l in enumerate(b.get('lines', [])):
for s_ctr, s in enumerate(l['spans']):
color = s.get('color')
if color is not None and color not in seen:
seen.add(color)
print(f"B{b_ctr}.L{l_ctr}.S{s_ctr}: {color:8}, hex {hex(color):6}")
With output for PyMuPDF version 1.24.14 having positive colours numbers:
1.24.14
B0.L0.S0: 44526, hex 0xadee
B2.L0.S0: 0, hex 0x0
B6.L1.S0: 16777215, hex 0xffffff
and output for PyMuPDF version 1.25.0 having negative colour numbers:
1.25.0
B0.L0.S0: -16732433, hex -0xff5111
B2.L0.S0: -16777216, hex -0x1000000
B6.L0.S0: -1, hex -0x1
This will break any code using PyMuPDF to find text based on predetermined color codes!
As to what caused this, the MuPDF release notes (https://mupdf.com/releases/history) for 1.25.0 RC2 do say that color
changed to rgba
with the addition of alpha channel.... I can't find anything that seems related in the PyMuPDF github.
(Note also that grouping of text into blocks/lines/spans changed between 1.24.14 and 1.25.0.)
How to reproduce the bug
See description above.
PyMuPDF version
1.25.0
Operating system
Linux
Python version
3.11