Skip to content

New line character missing and URLs adding periods and space #1974

@AlexNguyen124

Description

@AlexNguyen124

2 issues to report. Not sure if these are bugs or feature.

First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf

In the first few lines of the output we see:

Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION  3
 —About our report 3
• Reporting framework 4
• Third-party assurance 4
 —Corporate sustainability at Air Canada 5

Immediately, there a few inaccuracies:

  • 2nd line: "Report" and "Citizens" should be separated
  • 3rd line "2" and "Contents"

The page we are trying to convert has many columns and I suspect there is missing a newline character.

Second Space are added to urls. Consider what I have found in the output:
"www. aircanada. com/ citizensoftheworld"

I hope this helps.

Environment

Google Colab

doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
    text += page.extract_text()
out = open(path_to_txt, "w")  # create a text output
out.write(text)
out.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    whitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions