Skip to content

get_texttrace returned a incorrect character bbox #2533

Closed
@little-polka-dot

Description

@little-polka-dot

Describe the bug (mandatory)

get_texttrace returned a incorrect character bbox

To Reproduce (mandatory)

盐城高新区投资集团有限公司2023年度第四期超短期融资券募集说明书.pdf

d = fitz.Document('temp/盐城高新区投资集团有限公司2023年度第四期超短期融资券募集说明书.pdf')
page = d[250]
for span in page.get_texttrace():
    for char in span['chars']:
        if chr(char[0]) == '民':
            print(char)

The above code will print this message:

(27665, 8775, (114.0, 634.72998046875), (114.0, 633.2604370117188, 124.44999694824219, 643.71044921875)

Origin y (634.72998046875) is too close to bbox y0 (633.2604370117188), and this is obviously not right.

Screenshot of the incorrect character bbox (red).
test

1689124136230(1)

Screenshot code:

page.get_pixmap(matrix=fitz.Matrix(2, 2), alpha=False, clip=fitz.Rect(114.0, 633.2604370117188, 124.44999694824219, 643.71044921875)).save(f'test.png')

Your configuration (mandatory)

print(sys.version, "\n", sys.platform, "\n", fitz.doc)
3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)]
win32

PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.7 on win32 (64-bit).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions