New line character missing and URLs adding periods and space

2 issues to report. Not sure if these are bugs or feature.

First, often, end of line words are concatenated with begining of next line words.
For example:
I used pypdf on the following PDF (but the same occurs in other PDF)
https://www.aircanada.com/content/dam/aircanada/portal/documents/PDF/en/corporate-sustainability/2021-cs-report.pdf
https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf

In the first few lines of the output we see: 
```
Citizens of the World
2021 Corporate Sustainability ReportCitizens of the World 2021
Corporate Sustainability Report 2Contents
INTRODUCTION  3
 —About our report 3
• Reporting framework 4
• Third-party assurance 4
 —Corporate sustainability at Air Canada 5
```

Immediately, there a few inaccuracies: 
* 2nd line: "Report" and "Citizens" should be separated
* 3rd line "2" and "Contents"

The page we are trying to convert has many columns and I suspect  there is missing a newline character.

Second Space are added to urls. Consider what I have found in the output:
"www.  aircanada.  com/  citizensoftheworld"

I hope this helps.


## Environment

Google Colab

```python
doc = PdfReader(path_to_pdf)
text = ""
path_to_txt = os.path.join(txt_path, "pypdf", fname) + ".txt"
print(path_to_txt)
for page in doc.pages:
    text += page.extract_text()
out = open(path_to_txt, "w")  # create a text output
out.write(text)
out.close()
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New line character missing and URLs adding periods and space #1974

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New line character missing and URLs adding periods and space #1974

Description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions