-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
2 issues to report. Not sure if these are bugs or feature.
First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf
In the first few lines of the output we see:
Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION 3
—About our report 3
• Reporting framework 4
• Third-party assurance 4
—Corporate sustainability at Air Canada 5
Immediately, there a few inaccuracies:
- 2nd line: "Report" and "Citizens" should be separated
- 3rd line "2" and "Contents"
The page we are trying to convert has many columns and I suspect there is missing a newline character.
Second Space are added to urls. Consider what I have found in the output:
"www. aircanada. com/ citizensoftheworld"
I hope this helps.
Environment
Google Colab
doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
text += page.extract_text()
out = open(path_to_txt, "w") # create a text output
out.write(text)
out.close()