Multi-page document to single page document. #266

jamie-lemon · 2025-05-06T17:01:37Z

Adds a concept for converting the supplied document into a long single page PDF during the to_markdown() processing.

Adds a concept for converting the supplied document into a long single page PDF during the `to_markdown()` processing.

jamie-lemon · 2025-05-06T21:07:58Z

@JorjMcKie @outstanding1301 So the idea here would be that for some document types, like invoices which may have tables spanning multiple pages for example, that the conversion to a single page may help with the "understanding" of continuous tables. That is the basic use case that I was imagining. I was also thinking about the comment about how a markdown version of a document doesn't care about pages. What do you think?

outstanding1301 · 2025-05-07T02:20:34Z

@jamie-lemon
Combining the pages may not necessarily lead to a better Markdown representation of the table, as it could result in two separate tables (like bellow)

But some documents have two pages that are paired together. In this case, it would make sense to combine them if you using pages_per_row parameter.

Simply concatenating the pages may not significantly improve the result—it’s similar to converting each to Markdown and then merging them.

It would be a better representation of the content if we could combine the separated elements.

jamie-lemon · 2025-05-07T07:49:05Z

Yes, actually after trying this again with a sample document I couldn't find any real difference on the table detection. So I think I was wrong there. I thought the last time I tried with:
pymupdf4llm.to_markdown("test.pdf", parse_single_long_page=True, table_strategy='lines')
and:
pymupdf4llm.to_markdown("test.pdf", parse_single_long_page=False, table_strategy='lines')
that I was getting better results with the one single one page, but it was just wishful thinking! ( i.e. I am unable to prove this )

Probably there is no real benefit in this functionality.

jamie-lemon · 2025-05-07T12:45:42Z

Okay, still investigating this concept!
Using the attached file
one-table-2-pages.pdf

I get the following results:

parse_single_long_page=False, table_strategy='lines'

|Product|SKU|Price|Notes|
|---|---|---|---|
|A page 1|1|£1|Hello world A|
|B page 1|2|£2|Hello world B|
|C page 1|3|£3|Hello world C|
|D page 1|4|£4|Hello world D|
|E page 1|5|£5|Hello world E|
|F page 1|6|£6|Hello world F|
|G page 1|7|£7|Hello world G|
|H page 1|8|£8|Hello world H|
-----
|I page 2|9|£9|Hello world I|
|---|---|---|---|
|J page 2|10|£10|Hello world J|
|K page 2|11|£11|Hello world K|
-----

parse_single_long_page=True, table_strategy='lines'

|Product|SKU|Price|Notes|
|---|---|---|---|
|A page 1|1|£1|Hello world A|
|B page 1|2|£2|Hello world B|
|C page 1|3|£3|Hello world C|
|D page 1|4|£4|Hello world D|
|E page 1|5|£5|Hello world E|
|F page 1|6|£6|Hello world F|
|G page 1|7|£7|Hello world G|
|H page 1|8|£8|Hello world H|
|I page 2|9|£9|Hello world I|
|---|---|---|---|
|J page 2|10|£10|Hello world J|
|K page 2|11|£11|Hello world K|
-----

Disregarding the weird blank row after row I (I can't explain that!) you can see the output inserts a page break with ----- when we don't convert to one page.

Thoughts?

jamie-lemon · 2025-05-07T12:58:55Z

Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param )

JorjMcKie · 2025-05-07T14:17:14Z

exactly - will do that!

…

________________________________ Von: Jamie Lemon ***@***.***> Gesendet: Mittwoch, 7. Mai 2025 08:59 An: pymupdf/RAG ***@***.***> Cc: Jorj X. McKie ***@***.***>; Mention ***@***.***> Betreff: Re: [pymupdf/RAG] Multi-page document to single page document. (PR #266) [https://avatars.githubusercontent.com/u/107279992?s=20&v=4]jamie-lemon left a comment (pymupdf/RAG#266)<#266 (comment)> Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param ) Screenshot.2025-05-07.at.13.57.09.png (view on web)<https://github.com/user-attachments/assets/e763f8f1-f376-4fe6-8f3f-ec708217e4db> — Reply to this email directly, view it on GitHub<#266 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIUJJKQTEBLFG7S7XMT25H7SNAVCNFSM6AAAAAB4RXG5DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJYGQ4TSMZSHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

JorjMcKie · 2025-05-07T14:21:33Z

done - will be active with next version.

…

________________________________ Von: Jamie Lemon ***@***.***> Gesendet: Mittwoch, 7. Mai 2025 08:59 An: pymupdf/RAG ***@***.***> Cc: Jorj X. McKie ***@***.***>; Mention ***@***.***> Betreff: Re: [pymupdf/RAG] Multi-page document to single page document. (PR #266) [https://avatars.githubusercontent.com/u/107279992?s=20&v=4]jamie-lemon left a comment (pymupdf/RAG#266)<#266 (comment)> Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param ) Screenshot.2025-05-07.at.13.57.09.png (view on web)<https://github.com/user-attachments/assets/e763f8f1-f376-4fe6-8f3f-ec708217e4db> — Reply to this email directly, view it on GitHub<#266 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIUJJKQTEBLFG7S7XMT25H7SNAVCNFSM6AAAAAB4RXG5DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJYGQ4TSMZSHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

JorjMcKie · 2025-05-07T14:56:08Z

Disregarding the weird blank row after row I (I can't explain that!) ...

As mentioned, the page-finishing string "-----" will be removed in the next version.
Otherwise, your question is actually easy to answer:
Those two table segments will be perceived as separate tables by table finder: there is no connection between them, no gridlines, etc.

And because GitHub Markdown tables are obliged to have headers, PyMuPDF will establish one: the first row will be selected.
And a table header row in markdown is defined by a string pattern "|---|---...|".

jamie-lemon · 2025-05-07T16:27:59Z

@JorjMcKie Your explanation makes sense, but in the PDF the "I" row is at the top of page 2, so I would have expected the header to be inserted above that row, not below it?

jamie-lemon · 2025-05-07T19:54:27Z

@JorjMcKie ^ please scratch that last comment - I obviously wasn't thinking right when I wrote it! :)

JorjMcKie · 2025-05-09T09:28:01Z

Another (potential) problem came up my mind:
The text block / multi-column detector also tries to detect the reading sequence among these blocks.
It does this imposing the priority "down first, then right".

For the "infinitely long" page (after joining) this could mean that for a 2-column page document, the left columns across all original pages are extracted first, before turning to the right column.

JorjMcKie · 2025-05-10T14:26:54Z

Here is an argument that sheds more doubt on the usefulness of the approach.

I executed tests to confirm the behavior as shown for the joined pages.

Multi-page document to single page document.

5219ebb

Adds a concept for converting the supplied document into a long single page PDF during the `to_markdown()` processing.

jamie-lemon marked this pull request as draft May 6, 2025 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-page document to single page document. #266

Multi-page document to single page document. #266

Uh oh!

jamie-lemon commented May 6, 2025

Uh oh!

jamie-lemon commented May 6, 2025

Uh oh!

outstanding1301 commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025 •

edited

Loading

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

JorjMcKie commented May 7, 2025 via email

Uh oh!

JorjMcKie commented May 7, 2025 via email

Uh oh!

JorjMcKie commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

JorjMcKie commented May 9, 2025

Uh oh!

JorjMcKie commented May 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Multi-page document to single page document. #266

Are you sure you want to change the base?

Multi-page document to single page document. #266

Uh oh!

Conversation

jamie-lemon commented May 6, 2025

Uh oh!

jamie-lemon commented May 6, 2025

Uh oh!

outstanding1301 commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

JorjMcKie commented May 7, 2025 via email

Uh oh!

JorjMcKie commented May 7, 2025 via email

Uh oh!

JorjMcKie commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

jamie-lemon commented May 7, 2025

Uh oh!

JorjMcKie commented May 9, 2025

Uh oh!

JorjMcKie commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jamie-lemon commented May 7, 2025 •

edited

Loading

JorjMcKie commented May 10, 2025 •

edited

Loading