Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with multiple columns in simple text #135

Closed
pascucg opened this issue Sep 12, 2024 · 5 comments
Closed

Problem with multiple columns in simple text #135

pascucg opened this issue Sep 12, 2024 · 5 comments
Labels
bug Something isn't working fix developed

Comments

@pascucg
Copy link

pascucg commented Sep 12, 2024

Hello, I see that in some cases the columns are not processed correctly.
It jumps from information in one column to another, causing the resulting information to be out of order and incorrect.

I provide you with an example pdf where it occurs:
ejemplo.pdf

I also show you the problem below:
column_error_example

If I process the pdf with pymupdf, it does it correctly:
column_correct_example

@JorjMcKie JorjMcKie added bug Something isn't working fix developed labels Sep 12, 2024
@JorjMcKie
Copy link
Contributor

I found the problem: The joining of original text blocks happens too aggressively, so the page number at the bottom gets joined and recursively causes all the text on page being joined in one big single block.
This causes nonsense to come out in the end.

As a quick fix, you can use margins=(0, 0, 0, 72) to ignore the page number block.

@JorjMcKie
Copy link
Contributor

Fixed in version 0.0.15.

@pascucg
Copy link
Author

pascucg commented Sep 17, 2024

Hello @JorjMcKie ,

Thank you for the fix, I have tested it in version 0.0.16 and it works correctly.

Running tests I see another problem that I indicate below.

In some documents the first line of the page is omitted.

I show you an example below.
ejemplo.pdf

ejemplo

@JorjMcKie
Copy link
Contributor

No, this works for margins=0:
image

@pascucg
Copy link
Author

pascucg commented Sep 17, 2024

You are right, using:
doc = pymupdf4llm.to_markdown('ejemplo.pdf', page_chunks=True, margins=0)

It works correctly.

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix developed
Projects
None yet
Development

No branches or pull requests

2 participants