Skip to content

Random whitespaces are inserted when using page.extract_text() #1507

@einelson

Description

@einelson

I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

Environment

Using VS code and running via command prompt.

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1

Code + PDF

This is a minimal, complete example that shows the issue:

test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:

image

The code is:

import os

from PyPDF2 import PdfReader, __version__

pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))

print(f"PyPDF2=={__version__}")

text = ""
for page in pdf.pages:
    page_content = page.extract_text()
    text = text + page_content
print(text)

Output

PyPDF2==2.12.1
This is a test document by Ethan Nelson.  
 
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
testing purposes : 341 Maple st Paytonville Maine 45681.  
Anyway, there are random whitespaces here . 

Metadata

Metadata

Assignees

No one assigned

    Labels

    whitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions