Problem with multiple columns in simple text #135

pascucg · 2024-09-12T11:25:56Z

Hello, I see that in some cases the columns are not processed correctly.
It jumps from information in one column to another, causing the resulting information to be out of order and incorrect.

I provide you with an example pdf where it occurs:
ejemplo.pdf

I also show you the problem below:

If I process the pdf with pymupdf, it does it correctly:

JorjMcKie · 2024-09-12T14:26:32Z

I found the problem: The joining of original text blocks happens too aggressively, so the page number at the bottom gets joined and recursively causes all the text on page being joined in one big single block.
This causes nonsense to come out in the end.

As a quick fix, you can use margins=(0, 0, 0, 72) to ignore the page number block.

JorjMcKie · 2024-09-16T10:00:20Z

Fixed in version 0.0.15.

pascucg · 2024-09-17T07:14:58Z

Hello @JorjMcKie ,

Thank you for the fix, I have tested it in version 0.0.16 and it works correctly.

Running tests I see another problem that I indicate below.

In some documents the first line of the page is omitted.

I show you an example below.
ejemplo.pdf

JorjMcKie · 2024-09-17T07:40:50Z

No, this works for margins=0:

pascucg · 2024-09-17T10:28:58Z

You are right, using:
doc = pymupdf4llm.to_markdown('ejemplo.pdf', page_chunks=True, margins=0)

It works correctly.

Thank you

JorjMcKie added bug Something isn't working fix developed labels Sep 12, 2024

JorjMcKie closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with multiple columns in simple text #135

Problem with multiple columns in simple text #135

pascucg commented Sep 12, 2024

JorjMcKie commented Sep 12, 2024

JorjMcKie commented Sep 16, 2024

pascucg commented Sep 17, 2024

JorjMcKie commented Sep 17, 2024

pascucg commented Sep 17, 2024

Problem with multiple columns in simple text #135

Problem with multiple columns in simple text #135

Comments

pascucg commented Sep 12, 2024

JorjMcKie commented Sep 12, 2024

JorjMcKie commented Sep 16, 2024

pascucg commented Sep 17, 2024

JorjMcKie commented Sep 17, 2024

pascucg commented Sep 17, 2024