Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manual coordinate constraints to partition_pdf(). #3072

Open
ChiNoel-osu opened this issue May 22, 2024 · 9 comments
Open

Add manual coordinate constraints to partition_pdf(). #3072

ChiNoel-osu opened this issue May 22, 2024 · 9 comments
Labels
enhancement New feature or request pdf

Comments

@ChiNoel-osu
Copy link

Is your feature request related to a problem? Please describe.
I'm using hi_res strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.
Normally it wouldn't be a problem because I'll just use the element.metadata.coordinates.points to filter them out but in my documents the headers and footers have logos in it which makes it an Image and partition_pdf() will run OCR on it. And OCR is slow.
By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.

Describe the solution you'd like
Ability to specify start/end coordinates in partition_pdf() function. Maybe it can be added to other partition functions as well.

Describe alternatives you've considered
I came across #2455 but it's for fast strategy and it doens't allow manual constraints too.
Right now I need to wait for the partition_pdf() to finish and filter the header and footers out using element.metadata.coordinates.points.

cleaned_elements = [
    element
    for element in elements
    # Nuke the element even if 1 point is outside the cutcoord.
    if all(
        cutcoord_top < coord[1] < cutcoord_bottom
        for coord in element.metadata.coordinates.points
    )
    and (start_page_number < element.metadata.page_number < stop_page_number)
]

Additional context
Nope.

@ChiNoel-osu ChiNoel-osu added the enhancement New feature or request label May 22, 2024
@MthwRobinson
Copy link
Contributor

Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?

@ChiNoel-osu
Copy link
Author

Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case?

@MthwRobinson Yes and that's what I'm doing right now. But I think filtering out the elements early before OCR takes place can significantly improve the speed of partitioning process.

@huangpan2507
Copy link

cutcoord_top

Hi, @ChiNoel-osu
the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

@ChiNoel-osu
Copy link
Author

ChiNoel-osu commented Jul 4, 2024

Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

Hi @huangpan2507 they're just custom variables, change them to get different results.

@huangpan2507
Copy link

Hi, @ChiNoel-osu the “cutcoord_top" "cutcoord_bottom" come from where?and how to get the value of stop_page_number? could you help me?

Hi @huangpan2507 they're just custom variables, change them to get different results.

@ChiNoel-osu Thanks for your reply, but I still wonder these custom variables, how to get these variables? I had met a problem about partition_pdf, I want to filter the Header and footer, but oops, it extract something like

"Page2of 29
2024-1-1
Leave Policy xxx in China
Human Resources"

these thing is about the Header, it is in the element .category :CompositeElement.
That is to say, when I want to use partition_pdf to deal with pdf file(text, table, pic inside),
I print the element.category, the result shows like this:
element .category :CompositeElement element .category :Table element .category :Table element .category :CompositeElement element .category :Table element .category :CompositeElement ....
then,
"Page2of 29
2024-1-1
Leave Policy xxx in China
Human Resources" (these words are about Header),they are in the element .category :CompositeElement, how can I filter the Header and footer? Can you help me?

@ChiNoel-osu
Copy link
Author

@huangpan2507 When you do partition_pdf(), each element will have a coordinate that you can find by going into element.metadata.coordinates.points. Then you can easily check those numbers and filter out elements you don't need.
coord[1] in my code is its y coordinate, cutcoord_top and cutcoord_bottom will be based on the location of header and footer in your PDF.

@huangpan2507
Copy link

@huangpan2507 When you do partition_pdf(), each element will have a coordinate that you can find by going into element.metadata.coordinates.points. Then you can easily check those numbers and filter out elements you don't need. coord[1] in my code is its y coordinate, cutcoord_top and cutcoord_bottom will be based on the location of header and footer in your PDF.

Great! Thanks for your kindly help, I will try it in my case.

@huangpan2507
Copy link

@huangpan2507 When you do partition_pdf(), each element will have a coordinate that you can find by going into element.metadata.coordinates.points. Then you can easily check those numbers and filter out elements you don't need. coord[1] in my code is its y coordinate, cutcoord_top and cutcoord_bottom will be based on the location of header and footer in your PDF.

Hi, @ChiNoel-osu By the way, I had try to use the y coordinate to filter the Header and footer, but , the question is if I had a lot of different PDF files(the location of Header and footer are not the same),So, I need to check the y coordinate of every pdf?

@ChiNoel-osu
Copy link
Author

Hi, @ChiNoel-osu By the way, I had try to use the y coordinate to filter the Header and footer, but , the question is if I had a lot of different PDF files(the location of Header and footer are not the same),So, I need to check the y coordinate of every pdf?

Unfortunately yes.
However if your header and footers all have the same content then you could just leave them alone and replace their text with empty string later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pdf
Projects
None yet
Development

No branches or pull requests

4 participants