Skip to content

Multi-page document to single page document. #266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jamie-lemon
Copy link
Collaborator

Adds a concept for converting the supplied document into a long single page PDF during the to_markdown() processing.

Adds a concept for converting the supplied
document into a long single page PDF during
the `to_markdown()` processing.
@jamie-lemon jamie-lemon marked this pull request as draft May 6, 2025 21:02
@jamie-lemon
Copy link
Collaborator Author

@JorjMcKie @outstanding1301 So the idea here would be that for some document types, like invoices which may have tables spanning multiple pages for example, that the conversion to a single page may help with the "understanding" of continuous tables. That is the basic use case that I was imagining. I was also thinking about the comment about how a markdown version of a document doesn't care about pages. What do you think?

@outstanding1301
Copy link
Collaborator

@jamie-lemon
Combining the pages may not necessarily lead to a better Markdown representation of the table, as it could result in two separate tables (like bellow)
image

But some documents have two pages that are paired together. In this case, it would make sense to combine them if you using pages_per_row parameter.

Simply concatenating the pages may not significantly improve the result—it’s similar to converting each to Markdown and then merging them.

It would be a better representation of the content if we could combine the separated elements.

@jamie-lemon
Copy link
Collaborator Author

Yes, actually after trying this again with a sample document I couldn't find any real difference on the table detection. So I think I was wrong there. I thought the last time I tried with:
pymupdf4llm.to_markdown("test.pdf", parse_single_long_page=True, table_strategy='lines')
and:
pymupdf4llm.to_markdown("test.pdf", parse_single_long_page=False, table_strategy='lines')
that I was getting better results with the one single one page, but it was just wishful thinking! ( i.e. I am unable to prove this )

Probably there is no real benefit in this functionality.

@jamie-lemon
Copy link
Collaborator Author

jamie-lemon commented May 7, 2025

Okay, still investigating this concept!
Using the attached file
one-table-2-pages.pdf

I get the following results:

parse_single_long_page=False, table_strategy='lines'

|Product|SKU|Price|Notes|
|---|---|---|---|
|A page 1|1|£1|Hello world A|
|B page 1|2|£2|Hello world B|
|C page 1|3|£3|Hello world C|
|D page 1|4|£4|Hello world D|
|E page 1|5|£5|Hello world E|
|F page 1|6|£6|Hello world F|
|G page 1|7|£7|Hello world G|
|H page 1|8|£8|Hello world H|
-----
|I page 2|9|£9|Hello world I|
|---|---|---|---|
|J page 2|10|£10|Hello world J|
|K page 2|11|£11|Hello world K|
-----

parse_single_long_page=True, table_strategy='lines'

|Product|SKU|Price|Notes|
|---|---|---|---|
|A page 1|1|£1|Hello world A|
|B page 1|2|£2|Hello world B|
|C page 1|3|£3|Hello world C|
|D page 1|4|£4|Hello world D|
|E page 1|5|£5|Hello world E|
|F page 1|6|£6|Hello world F|
|G page 1|7|£7|Hello world G|
|H page 1|8|£8|Hello world H|
|I page 2|9|£9|Hello world I|
|---|---|---|---|
|J page 2|10|£10|Hello world J|
|K page 2|11|£11|Hello world K|
-----

Disregarding the weird blank row after row I (I can't explain that!) you can see the output inserts a page break with ----- when we don't convert to one page.

Thoughts?

@jamie-lemon
Copy link
Collaborator Author

Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param )
Screenshot 2025-05-07 at 13 57 09

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 7, 2025 via email

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 7, 2025 via email

@JorjMcKie
Copy link
Collaborator

Disregarding the weird blank row after row I (I can't explain that!) ...

As mentioned, the page-finishing string "-----" will be removed in the next version.
Otherwise, your question is actually easy to answer:
Those two table segments will be perceived as separate tables by table finder: there is no connection between them, no gridlines, etc.

And because GitHub Markdown tables are obliged to have headers, PyMuPDF will establish one: the first row will be selected.
And a table header row in markdown is defined by a string pattern "|---|---...|".

If - as would be normally done - the first row would be a repetition of "|Product|SKU|Price|Notes|", things would be more obvious ... and probably even welcomed.

@jamie-lemon
Copy link
Collaborator Author

@JorjMcKie Your explanation makes sense, but in the PDF the "I" row is at the top of page 2, so I would have expected the header to be inserted above that row, not below it?

@jamie-lemon
Copy link
Collaborator Author

@JorjMcKie ^ please scratch that last comment - I obviously wasn't thinking right when I wrote it! :)

@JorjMcKie
Copy link
Collaborator

Another (potential) problem came up my mind:
The text block / multi-column detector also tries to detect the reading sequence among these blocks.
It does this imposing the priority "down first, then right".

For the "infinitely long" page (after joining) this could mean that for a 2-column page document, the left columns across all original pages are extracted first, before turning to the right column.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented May 10, 2025

Here is an argument that sheds more doubt on the usefulness of the approach.
image

I executed tests to confirm the behavior as shown for the joined pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants