Closed
Description
Description of the bug
Thanks for your great work, PDF parsing has become simpler and more convenient.
I have been use PyMuPDF for a while, and I find the problem that text coordinate extraction error in some pdfs.
Really hoping to find a way to solute this problem. Again, thanks for your great work!
How to reproduce the bug
The code:
import pymupdf
def getblock_lines_dict(fitz_dict):
linelist = []
## 获取每页的 每行文本
for block in fitz_dict["blocks"]:
if block['type'] == 0: ## block type为0时是文本
paranum = block['number']
if 'lines' in block: # 如果文本块中有内容
for line in block['lines']: ## 认为line是一行文本
for span in line['spans']:
if span['text'].strip():
linelist.append([paranum, span['bbox'], span['text']])
return linelist
if __name__ == "__main__":
doc = pymupdf.open("test.pdf") # open a document
for page in doc: # iterate the document pages
dict = page.get_text("dict")
linelist = getblock_lines_dict(dict)
print(linelist)
I draw a picture for the results, basically the position coordinates of all the numbers are wrong.
There is the test pdf
number_bbox_error.pdf
PyMuPDF version
1.25.1
Operating system
Linux
Python version
3.10