Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The get_text() function results in abnormal spaces #3894

Closed
1339503169 opened this issue Sep 26, 2024 · 2 comments
Closed

The get_text() function results in abnormal spaces #3894

1339503169 opened this issue Sep 26, 2024 · 2 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@1339503169
Copy link

Description of the bug

The text extracted using the get_text() method in this file contains abnormal spaces. I have mentioned a similar issue before, and the last file was extracted using the get_text() method (flags=fitz-TEXT-INHIBIT_SPACES), but this method does not work for this file. Do I have any other methods to extract the correct text

How to reproduce the bug

error.pdf

import fitz
file_path = 'error.pdf'
doc = fitz.open(file_path)
text_1 = doc.get_text()
text_2 = get_text(flags=fitz.TEXT_INHIBIT_SPACES)

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Sep 26, 2024
@JorjMcKie
Copy link
Collaborator

This is the outcome for PyMuPDF v1.24.10, which looks pretty good:

doc=pymupdf.open("error.pdf")
pge=doc[0]
print(pge.get_text())
3158 SHI MO- OKUBO TOYAMA CI TY                                
076- 467- 2246   
TOYAMA PREF.  939- 2292 J APAN                                 
076- 467- 2249   
WL0029                  
AUG,  27.  2024  
TEL:
FAX:
NO.  :
DATE:
SOLD TO MESSRS: DELTA ELECTRONICS INT' L(SINGAPORE)PTE. LTD.                                                                
               COMPANY NO. 201016894N                                                                                    
               17 Kal l ang Juncti on,  #01- 01, Tri on, Si ngapore 339274                                                       
               Tel :  (65)6747- 5155 Fax: (65)6744- 9228                                                                     
     CONSIGNEE: DELTA ELECTRONICS INT' L(SINGAPORE)PTE. LTD.                                                                
               C/ O TIMAX CARGO SERVICES CO. , LTD.                                                                         
               LOT 840, DD125, PING HA ROAD, YUEN LONG, NEW TERRITORIES,                                                     
               HONG KONG   ZIP CODE: 00852                                                                               
               ATTN: MS. BONNIE HO/ MAY LEE                                                                                
               TEL: 852- 2616 9116  FAX: 852- 2616 0131                                                                     
       COUNTRY: PEOPLE' S REPUBLIC OF CHINA                                                                               
TO BE SHIPPED FROM: JAPAN                                                    TO: HONG KONG                                
               VIA:                                         SAILING ON OR ABOUT: AUG,  28.  2024                            
               PER: "DHL"                                                                                                
LETTER OF CREDIT NO                                                                                                     
              DATED                                                                                                     
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――    
    P. O. NO.                 DESCRIPTION OF GOODS     COUNTRY OF ORIGIN        QUANTITY     UNIT PRICE          AMOUNT    
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――    
                                                                           DDU HONG KONG                                
                                                                                             ( IN US$    )              
                                                                     IN PCS     (PER 1000)                              
HUMIDITY SENSOR ELEMENT                                                                                                 
    CPLB45006N             0939007500                            JAPAN         10, 000      @$132. 000       $1, 320. 00    
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――    
    TOTAL                                                                      10, 000 PCS                US$1, 320. 00    
                                                                PACKING:        1   CARTONS                              
                                                           GROSS WEIGHT:      4. 700 KGS                                  
CASE MARK:                                                                                                               
    DELTA/ HDK                                                                                                           
    HONG KONG                                                                                                           
    C/ NO. 5454                                                                                                           
    MADE IN JAPAN                                                                                                       
REMARKS:                                                                                                                 
    PAYMENT ;  O/ A 90DAYS                                                                                                
    COUNTRY OF ORIGIN: JAPAN                                                                                             
                                                               HOKURIKU ELECTRIC INDUSTRY CO. , LTD.                       
                                                               SIGNED BY                                                
                                                               ―――――――――――――――――――――               
                                                               MANAGER                                                  
      ____      
      INVOICE       
Hokuri ku El ectri c Industry Co. , Ltd.

@1339503169
Copy link
Author

025E795545draft.pdf
After changing the version to 1.24.11, there are still abnormal spaces in the file I uploaded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants