Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Graphical Images & Superscripts #116

Open
SBhat2615 opened this issue Aug 26, 2024 · 7 comments
Open

Handling Graphical Images & Superscripts #116

SBhat2615 opened this issue Aug 26, 2024 · 7 comments

Comments

@SBhat2615
Copy link

Embedded images are extracted to a dedicated folder, which i observed for some of the documents.

There are some graphical images in the below pdf which are not getting extracted to separate folder.

There are also superscripts in the pdf, which are not referenced.

sample_document.pdf

@JorjMcKie
Copy link
Contributor

Please provide the script you used.

@SBhat2615
Copy link
Author

SBhat2615 commented Aug 26, 2024

Please provide the script you used.

import pymupdf4llm

md_text = pymupdf4llm.to_markdown(input_path, write_images=True)

output = open(output_path, "w")
output.write(md_text)
output.close()

@JorjMcKie
Copy link
Contributor

Don't let me guess please:
On which page are you missing what?

@SBhat2615
Copy link
Author

Don't let me guess please: On which page are you missing what?

  1. Figure 1 and 2 are not extracted as image.
  2. Table 3, 5, 6 is not extracted as image.

sample_document.md

@SBhat2615
Copy link
Author

For superscripts, if we can get output similar to this, that would be good as well.

Screenshot 2024-08-27 at 11 14 29 AM

@CedricLor
Copy link

CedricLor commented Sep 7, 2024

As regards the superscript handling improvement request, I guess what you're looking for is a feature handling footnotes and footnote references.

This would obviously be useful but it would imply a major refactoring.

For a naive approach, it would mean first detecting superscript text within the body text (this is already here), saving them in some data structure for further processing, then detecting and differentiating the footnotes from the body text on the page, then matching the footnotes with the references.

Since the footnotes are usually located at the bottom of the page and the footnote references inside the body text and pymupdf4llm generates the string linearly, this would mean that the script would need to use the saved references to try and match the beginning of the lines at the bottom of page. So far, not that difficult.

However, this would then mean that once the footnote has been matched, we would have to go back into the string to create the reference.

However, sometimes, footnote references are incremented at page level and their index is reset on each page which would mean that in a single md string for a multi page document, there would be ambiguous footnotes and footnote references, meaning that the script would also need to handle an eventual re-numbering.

Some documents also include simultaneously various symbols for the footnote references (e.g. numbers and roman numbers, for instance, to differentiate the author's footnotes from the publisher's or the translator's footnotes) and these would also need to be differentiated and tracked in the data structure.

Finally, superscript text might also be references to endnotes or mark other information (e.g. "tm", copyright symbol, the "o" in a number symbol on "no", aso.).

All this processing would probably have some performance impact.

So while the feature would obviously be welcome, this makes it almost a package on its own and I personally think that it would probably be better handled in a post-processing script of its own doing only this and doing it well instead of directly into pymupdf4llm.

@JorjMcKie
Copy link
Contributor

@CedricLor - thank you for your thoughtful assessment on footnotes.
I totally agree with you:
This is something we will probably never support for all the reasons you were mentioning: simply out of scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants