Add manual coordinate constraints to `partition_pdf()`.

**Is your feature request related to a problem? Please describe.**
I'm using `hi_res` strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.
Normally it wouldn't be a problem because I'll just use the `element.metadata.coordinates.points` to filter them out but in my documents the headers and footers have logos in it which makes it an `Image` and `partition_pdf()` will run OCR on it. And OCR is slow.
By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.

**Describe the solution you'd like**
Ability to specify start/end coordinates in `partition_pdf()` function. Maybe it can be added to other partition functions as well.

**Describe alternatives you've considered**
I came across https://github.com/Unstructured-IO/unstructured/pull/2455 but it's for `fast` strategy and it doens't allow manual constraints too.
Right now I need to _**wait**_ for the `partition_pdf()` to finish and filter the header and footers out using `element.metadata.coordinates.points`.
```python
cleaned_elements = [
    element
    for element in elements
    # Nuke the element even if 1 point is outside the cutcoord.
    if all(
        cutcoord_top < coord[1] < cutcoord_bottom
        for coord in element.metadata.coordinates.points
    )
    and (start_page_number < element.metadata.page_number < stop_page_number)
]
```

**Additional context**
Nope.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add manual coordinate constraints to `partition_pdf()`. #3072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add manual coordinate constraints to partition_pdf(). #3072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add manual coordinate constraints to `partition_pdf()`. #3072