-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manual coordinate constraints to partition_pdf()
.
#3072
Comments
Hi @ChiNoel-osu ! Thanks for submitting this. Would filtering on the coordinates after partitioning solve your use case? |
@MthwRobinson Yes and that's what I'm doing right now. But I think filtering out the elements early before OCR takes place can significantly improve the speed of partitioning process. |
Hi, @ChiNoel-osu |
Hi @huangpan2507 they're just custom variables, change them to get different results. |
@ChiNoel-osu Thanks for your reply, but I still wonder these custom variables, how to get these variables? I had met a problem about partition_pdf, I want to filter the Header and footer, but oops, it extract something like
these thing is about the Header, it is in the element .category :CompositeElement. |
@huangpan2507 When you do |
Great! Thanks for your kindly help, I will try it in my case. |
Hi, @ChiNoel-osu By the way, I had try to use the y coordinate to filter the Header and footer, but , the question is if I had a lot of different PDF files(the location of Header and footer are not the same),So, I need to check the y coordinate of every pdf? |
Unfortunately yes. |
Is your feature request related to a problem? Please describe.
I'm using
hi_res
strategy for my PDF files because I need to extract all images and tables etc. and the PDF files have headers and footers that I wish to remove.Normally it wouldn't be a problem because I'll just use the
element.metadata.coordinates.points
to filter them out but in my documents the headers and footers have logos in it which makes it anImage
andpartition_pdf()
will run OCR on it. And OCR is slow.By specifying start/end coordinates, elements that are considered "outside" shall not go through OCR and thus saving time.
Describe the solution you'd like
Ability to specify start/end coordinates in
partition_pdf()
function. Maybe it can be added to other partition functions as well.Describe alternatives you've considered
I came across #2455 but it's for
fast
strategy and it doens't allow manual constraints too.Right now I need to wait for the
partition_pdf()
to finish and filter the header and footers out usingelement.metadata.coordinates.points
.Additional context
Nope.
The text was updated successfully, but these errors were encountered: