Layout extraction without PyMuPDFLLM #4380

mmarusiak · 2025-03-15T20:23:46Z

mmarusiak
Mar 15, 2025

Module seems to work generally fine, it extracts my pdf to blocks, lines and spans.
The problem comes when my pdf contains some text written in different font (f.e. math equations). It tends to move them into new block especially when dealing with some upper/bottom index but also with integrals and just most of common math notations.

So my plan seems a bit silly and overwhelming for me, it seems like not best approach:

Check if new block is not on "same" y pos as block before (by same i mean just if it's in the same line).
Check if new block's first line has same indentation as last line of last block.
If not check if next line in new block has indentation (for handling lists with new lines).

And somewhere in between take a look if it is math notation etc. but that's not the biggest problem.

So is it the approach, or am I missing f.e. some arguments that can handle pdf layout extractions better for me (using read_text("dict", flags=0))? I want to use pymupdf to read some scientific papers with correct layout. As you can see my approach is pretty messy and that's why I'm just asking!

Anyway, thanks for the module, it's the best way to extract layout from pdf without OCRs and LLMs!
Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Layout extraction without PyMuPDFLLM #4380

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Layout extraction without PyMuPDFLLM #4380

Uh oh!

mmarusiak Mar 15, 2025

Replies: 0 comments

mmarusiak
Mar 15, 2025