-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem? Please describe.
I'm using hi_res strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.
Normally it wouldn't be a problem because I'll just use the element.metadata.coordinates.points to filter them out but in my documents the headers and footers have logos in it which makes it an Image and partition_pdf() will run OCR on it. And OCR is slow.
By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.
Describe the solution you'd like
Ability to specify start/end coordinates in partition_pdf() function. Maybe it can be added to other partition functions as well.
Describe alternatives you've considered
I came across #2455 but it's for fast strategy and it doens't allow manual constraints too.
Right now I need to wait for the partition_pdf() to finish and filter the header and footers out using element.metadata.coordinates.points.
cleaned_elements = [
element
for element in elements
# Nuke the element even if 1 point is outside the cutcoord.
if all(
cutcoord_top < coord[1] < cutcoord_bottom
for coord in element.metadata.coordinates.points
)
and (start_page_number < element.metadata.page_number < stop_page_number)
]Additional context
Nope.