Layout extraction without PyMuPDFLLM #4380
Unanswered
mmarusiak
asked this question in
Looking for help
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Module seems to work generally fine, it extracts my pdf to blocks, lines and spans.
The problem comes when my pdf contains some text written in different font (f.e. math equations). It tends to move them into new block especially when dealing with some upper/bottom index but also with integrals and just most of common math notations.
So my plan seems a bit silly and overwhelming for me, it seems like not best approach:
And somewhere in between take a look if it is math notation etc. but that's not the biggest problem.
So is it the approach, or am I missing f.e. some arguments that can handle pdf layout extractions better for me (using
read_text("dict", flags=0)
)? I want to use pymupdf to read some scientific papers with correct layout. As you can see my approach is pretty messy and that's why I'm just asking!Anyway, thanks for the module, it's the best way to extract layout from pdf without OCRs and LLMs!
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions