Skip to content

Text coordinate extraction error #4182

Closed
@Number18-tong

Description

@Number18-tong

Description of the bug

Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.

Really hoping to find a way to solute this problem. Again, thanks for your great work!

How to reproduce the bug

The code:

import pymupdf
def getblock_lines_dict(fitz_dict):
    linelist = []
    ## 获取每页的 每行文本
    for block in fitz_dict["blocks"]:
        if block['type'] == 0:  ## block type为0时是文本
            paranum = block['number']
            if 'lines' in block:   # 如果文本块中有内容
                for line in block['lines']:   ## 认为line是一行文本
                    for span in line['spans']:
                        if span['text'].strip():
                            linelist.append([paranum, span['bbox'], span['text']])
    return linelist

if __name__ == "__main__":
    doc = pymupdf.open("test.pdf") # open a document
    for page in doc: # iterate the document pages
        dict = page.get_text("dict")
        linelist = getblock_lines_dict(dict)
    print(linelist)

I draw a picture for the results, basically the position coordinates of all the numbers are wrong.
企业微信截图_17352721963259

There is the test pdf
number_bbox_error.pdf

PyMuPDF version

1.25.1

Operating system

Linux

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions