Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Extract superscripts (x² instead of x2) #2045

Open
MartinThoma opened this issue Jul 30, 2023 · 7 comments
Open

ENH: Extract superscripts (x² instead of x2) #2045

MartinThoma opened this issue Jul 30, 2023 · 7 comments
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jul 30, 2023

Explanation

Superscripts are common in math, especially squares (e.g. x²) and cubes (e.g. x³).

Code Example

How would your feature be used? (Remove this if it is not applicable.)

from pypdf import PdfReader

reader = PdfReader("example.pdf)
print(reader.pages[0].extract_text())

Examples with the expected output:

Filename               | Currently extracted     | Expected
---------------------- | ----------------------- | -----------------------
pdflatex-x-square.pdf  | x2= 9 means x∈{3,−3}.   | x²= 9 means x∈{3,−3}.
LibreOffice-Writer.pdf | The square of x is denoted by x², the cube by x³. | Already as expected 🎉
@MartinThoma MartinThoma self-assigned this Jul 30, 2023
@MartinThoma MartinThoma added whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow and removed whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Jul 30, 2023
@miriam-z
Copy link

miriam-z commented Jul 30, 2023

@MartinThoma:

Given these two examples above, why did the extraction of:

LibreOffice-Writer.pdf -> The square of x is denoted by x², the cube by x³.

which is perfect :)

But:

pdflatex-x-square.pdf -> x2= 9 means x∈{3,−3}.

@pubpub-zz
Copy link
Collaborator

a wonderfull tool to do analysis is pdfbox in debug view
for the Libreoffice when you look at the used font you will see:
image

however for the pdflatex, they are changing font size and position
image

@MartinThoma
Copy link
Member Author

I didn't analyze it so far but I guess that Libre office makes use of the Unicode symbol. In contrast, latex changes the font size / position of a normal "2"

@miriam-z
Copy link

miriam-z commented Jul 30, 2023

This is my example but it is empty when extracted?

Screenshot 2023-07-30 at 20.07.16.pdf

Text too large and pixelation issue?

@MartinThoma
Copy link
Member Author

@miriam-z
Copy link

so taking a screenshot I thought need tesseract OCR instead? Does it have python ?

@MartinThoma
Copy link
Member Author

@miriam-z Please ask your questions in https://github.com/py-pdf/pypdf/discussions/categories/q-a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

3 participants