-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The get_text() function results in abnormal spaces #3894
Labels
not a bug
not a bug / user error / unable to reproduce
Comments
This is the outcome for PyMuPDF v1.24.10, which looks pretty good: doc=pymupdf.open("error.pdf")
pge=doc[0]
print(pge.get_text())
3158 SHI MO- OKUBO TOYAMA CI TY
076- 467- 2246
TOYAMA PREF. 939- 2292 J APAN
076- 467- 2249
WL0029
AUG, 27. 2024
TEL:
FAX:
NO. :
DATE:
SOLD TO MESSRS: DELTA ELECTRONICS INT' L(SINGAPORE)PTE. LTD.
COMPANY NO. 201016894N
17 Kal l ang Juncti on, #01- 01, Tri on, Si ngapore 339274
Tel : (65)6747- 5155 Fax: (65)6744- 9228
CONSIGNEE: DELTA ELECTRONICS INT' L(SINGAPORE)PTE. LTD.
C/ O TIMAX CARGO SERVICES CO. , LTD.
LOT 840, DD125, PING HA ROAD, YUEN LONG, NEW TERRITORIES,
HONG KONG ZIP CODE: 00852
ATTN: MS. BONNIE HO/ MAY LEE
TEL: 852- 2616 9116 FAX: 852- 2616 0131
COUNTRY: PEOPLE' S REPUBLIC OF CHINA
TO BE SHIPPED FROM: JAPAN TO: HONG KONG
VIA: SAILING ON OR ABOUT: AUG, 28. 2024
PER: "DHL"
LETTER OF CREDIT NO
DATED
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
P. O. NO. DESCRIPTION OF GOODS COUNTRY OF ORIGIN QUANTITY UNIT PRICE AMOUNT
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
DDU HONG KONG
( IN US$ )
IN PCS (PER 1000)
HUMIDITY SENSOR ELEMENT
CPLB45006N 0939007500 JAPAN 10, 000 @$132. 000 $1, 320. 00
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
TOTAL 10, 000 PCS US$1, 320. 00
PACKING: 1 CARTONS
GROSS WEIGHT: 4. 700 KGS
CASE MARK:
DELTA/ HDK
HONG KONG
C/ NO. 5454
MADE IN JAPAN
REMARKS:
PAYMENT ; O/ A 90DAYS
COUNTRY OF ORIGIN: JAPAN
HOKURIKU ELECTRIC INDUSTRY CO. , LTD.
SIGNED BY
―――――――――――――――――――――
MANAGER
____
INVOICE
Hokuri ku El ectri c Industry Co. , Ltd. |
025E795545draft.pdf |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description of the bug
The text extracted using the get_text() method in this file contains abnormal spaces. I have mentioned a similar issue before, and the last file was extracted using the get_text() method (flags=fitz-TEXT-INHIBIT_SPACES), but this method does not work for this file. Do I have any other methods to extract the correct text
How to reproduce the bug
error.pdf
import fitz
file_path = 'error.pdf'
doc = fitz.open(file_path)
text_1 = doc.get_text()
text_2 = get_text(flags=fitz.TEXT_INHIBIT_SPACES)
PyMuPDF version
1.24.5
Operating system
Windows
Python version
3.8
The text was updated successfully, but these errors were encountered: