Skip to content

Add manual coordinate constraints to partition_pdf(). #3072

Open
@ChiNoel-osu

Description

@ChiNoel-osu

Is your feature request related to a problem? Please describe.
I'm using hi_res strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.
Normally it wouldn't be a problem because I'll just use the element.metadata.coordinates.points to filter them out but in my documents the headers and footers have logos in it which makes it an Image and partition_pdf() will run OCR on it. And OCR is slow.
By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.

Describe the solution you'd like
Ability to specify start/end coordinates in partition_pdf() function. Maybe it can be added to other partition functions as well.

Describe alternatives you've considered
I came across #2455 but it's for fast strategy and it doens't allow manual constraints too.
Right now I need to wait for the partition_pdf() to finish and filter the header and footers out using element.metadata.coordinates.points.

cleaned_elements = [
    element
    for element in elements
    # Nuke the element even if 1 point is outside the cutcoord.
    if all(
        cutcoord_top < coord[1] < cutcoord_bottom
        for coord in element.metadata.coordinates.points
    )
    and (start_page_number < element.metadata.page_number < stop_page_number)
]

Additional context
Nope.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions