Description
Is your feature request related to a problem? Please describe.
I'm using hi_res
strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.
Normally it wouldn't be a problem because I'll just use the element.metadata.coordinates.points
to filter them out but in my documents the headers and footers have logos in it which makes it an Image
and partition_pdf()
will run OCR on it. And OCR is slow.
By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.
Describe the solution you'd like
Ability to specify start/end coordinates in partition_pdf()
function. Maybe it can be added to other partition functions as well.
Describe alternatives you've considered
I came across #2455 but it's for fast
strategy and it doens't allow manual constraints too.
Right now I need to wait for the partition_pdf()
to finish and filter the header and footers out using element.metadata.coordinates.points
.
cleaned_elements = [
element
for element in elements
# Nuke the element even if 1 point is outside the cutcoord.
if all(
cutcoord_top < coord[1] < cutcoord_bottom
for coord in element.metadata.coordinates.points
)
and (start_page_number < element.metadata.page_number < stop_page_number)
]
Additional context
Nope.