Skip to content

Text color numbers change between 1.24.14 and 1.25.0 #4139

Closed
@stevesimmons

Description

@stevesimmons

Description of the bug

When upgrading PyMuPDF from 1.24.14 to 1.25.0, the reported text color codes have changed.

I tested this with this code:

import pymupdf
print(pymupdf.__version__)
flags = pymupdf.TEXT_PRESERVE_IMAGES | pymupdf.TEXT_PRESERVE_WHITESPACE | pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE
doc = pymupdf.open("0d4cb925de9d383e.pdf")
page = doc[0]
dicts = page.get_text('dict', flags=flags, sort=True)
seen = set()
for b_ctr, b in enumerate(dicts['blocks']):
     for l_ctr, l in enumerate(b.get('lines', [])):
        for s_ctr, s in enumerate(l['spans']):
            color = s.get('color')
            if color is not None and color  not in seen:
                seen.add(color)
                print(f"B{b_ctr}.L{l_ctr}.S{s_ctr}: {color:8}, hex {hex(color):6}")

With output for PyMuPDF version 1.24.14 having positive colours numbers:

1.24.14
B0.L0.S0:    44526, hex 0xadee
B2.L0.S0:        0, hex 0x0   
B6.L1.S0: 16777215, hex 0xffffff

and output for PyMuPDF version 1.25.0 having negative colour numbers:

1.25.0
B0.L0.S0: -16732433, hex -0xff5111
B2.L0.S0: -16777216, hex -0x1000000
B6.L0.S0:       -1, hex -0x1  

This will break any code using PyMuPDF to find text based on predetermined color codes!

As to what caused this, the MuPDF release notes (https://mupdf.com/releases/history) for 1.25.0 RC2 do say that color changed to rgba with the addition of alpha channel.... I can't find anything that seems related in the PyMuPDF github.

(Note also that grouping of text into blocks/lines/spans changed between 1.24.14 and 1.25.0.)

How to reproduce the bug

See description above.

PyMuPDF version

1.25.0

Operating system

Linux

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions