-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ImageElements -> Parse images #64
Conversation
Thanks for the PR - this looks awesome! I will look a take a deeper look into this asap |
Okay, if you have any needs, please communicate in time. I will be happy to communicate with you. |
PdfMiner looks good - I changed the image schema to encode the image data in base64 str to allow for easy serialization. We also added a mime type. Finally added a test - that looks good. Looks like the pymupdf implementation is failing for me. import openparse
basic_doc_path = "/Users/sergey/Downloads/pdf-with-image.pdf"
pdf_obj = openparse.Pdf(basic_doc_path)
parsed_basic_doc = ingest(pdf_obj) Returns
I noticed you flipped the coordinates? Maybe why? fy0 = page.rect.height - node["bbox"][1]
fy1 = page.rect.height - node["bbox"][3] To be honest, I'm not even sure how important it is to have this implemented for pymupdf because those documents are already OCRd and not sure how that affects this? |
You can see the changes in the "parse-images-pdf-miner" branch - I can't seem to figure out how to merge it into this |
Indeed, for pymupdf, if the image has already been OCRed, then there is no need to parse the image anymore. However, I thought you would process the OCR in the process later, and only parse text in the text. |
How about this, I'll submit the image parsing part for pdfminer first. As for pymupdf, I'll keep it as it is, without adding an image parsing module. Is that okay? |
Sounds good! |
The images in the PDF were lost after parsing before. Now RAG needs to use images, so I added the image extraction function.