-
Notifications
You must be signed in to change notification settings - Fork 126
Multi-page document to single page document. #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds a concept for converting the supplied document into a long single page PDF during the `to_markdown()` processing.
@JorjMcKie @outstanding1301 So the idea here would be that for some document types, like invoices which may have tables spanning multiple pages for example, that the conversion to a single page may help with the "understanding" of continuous tables. That is the basic use case that I was imagining. I was also thinking about the comment about how a markdown version of a document doesn't care about pages. What do you think? |
@jamie-lemon But some documents have two pages that are paired together. In this case, it would make sense to combine them if you using Simply concatenating the pages may not significantly improve the result—it’s similar to converting each to Markdown and then merging them. It would be a better representation of the content if we could combine the separated elements. |
Yes, actually after trying this again with a sample document I couldn't find any real difference on the table detection. So I think I was wrong there. I thought the last time I tried with: Probably there is no real benefit in this functionality. |
Okay, still investigating this concept! I get the following results: parse_single_long_page=False, table_strategy='lines'
parse_single_long_page=True, table_strategy='lines'
Disregarding the weird blank row after row I (I can't explain that!) you can see the output inserts a page break with Thoughts? |
exactly - will do that!
…________________________________
Von: Jamie Lemon ***@***.***>
Gesendet: Mittwoch, 7. Mai 2025 08:59
An: pymupdf/RAG ***@***.***>
Cc: Jorj X. McKie ***@***.***>; Mention ***@***.***>
Betreff: Re: [pymupdf/RAG] Multi-page document to single page document. (PR #266)
[https://avatars.githubusercontent.com/u/107279992?s=20&v=4]jamie-lemon left a comment (pymupdf/RAG#266)<#266 (comment)>
Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param )
Screenshot.2025-05-07.at.13.57.09.png (view on web)<https://github.com/user-attachments/assets/e763f8f1-f376-4fe6-8f3f-ec708217e4db>
—
Reply to this email directly, view it on GitHub<#266 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIUJJKQTEBLFG7S7XMT25H7SNAVCNFSM6AAAAAB4RXG5DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJYGQ4TSMZSHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
done - will be active with next version.
…________________________________
Von: Jamie Lemon ***@***.***>
Gesendet: Mittwoch, 7. Mai 2025 08:59
An: pymupdf/RAG ***@***.***>
Cc: Jorj X. McKie ***@***.***>; Mention ***@***.***>
Betreff: Re: [pymupdf/RAG] Multi-page document to single page document. (PR #266)
[https://avatars.githubusercontent.com/u/107279992?s=20&v=4]jamie-lemon left a comment (pymupdf/RAG#266)<#266 (comment)>
Perhaps it is this line (1089) which we don't actually require? ( or make optional with a "show_page_breaks" param )
Screenshot.2025-05-07.at.13.57.09.png (view on web)<https://github.com/user-attachments/assets/e763f8f1-f376-4fe6-8f3f-ec708217e4db>
—
Reply to this email directly, view it on GitHub<#266 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIUJJKQTEBLFG7S7XMT25H7SNAVCNFSM6AAAAAB4RXG5DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNJYGQ4TSMZSHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
As mentioned, the page-finishing string "-----" will be removed in the next version. And because GitHub Markdown tables are obliged to have headers, PyMuPDF will establish one: the first row will be selected. If - as would be normally done - the first row would be a repetition of |
@JorjMcKie Your explanation makes sense, but in the PDF the "I" row is at the top of page 2, so I would have expected the header to be inserted above that row, not below it? |
@JorjMcKie ^ please scratch that last comment - I obviously wasn't thinking right when I wrote it! :) |
Another (potential) problem came up my mind: For the "infinitely long" page (after joining) this could mean that for a 2-column page document, the left columns across all original pages are extracted first, before turning to the right column. |
Adds a concept for converting the supplied document into a long single page PDF during the
to_markdown()
processing.