I am not sure if this is a bug. #3797
Replies: 3 comments
-
The attached PDF is different from the attached image! This clearly is no error and I also see no basis for whatever "enhancement". |
Beta Was this translation helpful? Give feedback.
-
I am talking about text extraction. You will find 'A194/C194 Cu Alloy' and 'Sample Name' are not extracted in the same line if you look at RED line 2 of reference image. |
Beta Was this translation helpful? Give feedback.
-
That too is not a bug but a technical peculiarity of MuPDF. You need your own code to recover lines that roughly like the ones visible. But there is example code that can be used for this: import pymupdf
# import a helper method from sister package
from pymupdf4llm.helpers.get_text_lines import get_text_lines
doc = pymupdf.open("test.pdf")
page = doc[0]
text = get_text_lines(page)
print(text) This produces the following output:
|
Beta Was this translation helpful? Give feedback.
-
I have a sample PDF. Hope that thse 5 interested lines can be extracted correctly and displayed correctly
(please refer to the RED underlined of attached PNG file)
The sample PDF file can be found here.
https://www.nxp.com/testreports/360000002263_CDA_194_ZHM_A_HLGN.pdf
(update sample PDF)
Beta Was this translation helpful? Give feedback.
All reactions