dify doesn't seem to support parsing image-only PDF files #11063

Tuyohai · 2024-11-25T08:37:30Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.11.2

Cloud or Self Hosted

Cloud

Steps to reproduce

Upload image-only PDF to the knowledge base.

✔️ Expected Behavior

Parse out the text in the image.

❌ Actual Behavior

The parsed result is empty.

dosubot · 2024-11-25T08:39:40Z

Hey @Tuyohai! I'm here to assist you with any bugs, questions, or contributions for Dify. Let's tackle this issue together!

I found a similar unsolved discussion regarding Dify's ability to parse images in PDFs and embed them into vector databases ^[1]. Additionally, there are no existing solutions or built-in OCR capabilities in Dify for extracting text from images within PDFs. The current PDF processing in Dify uses the pypdfium2 library, which does not support OCR for image-only PDFs ^[2].

To extend Dify's capabilities to include OCR for image-only PDFs, you might need to integrate an OCR library like Tesseract or a similar tool that can handle image-based text extraction.

To continue talking to Dosu, mention @dosu.

Tuyohai · 2024-11-25T08:42:27Z

Whether to consider adding OCR recognition, may deploy an OCR model, or use a third-party service like the LLM Service

crazywoola · 2024-11-25T12:03:26Z

#10829 We do have a tool pending on review.

dosubot · 2024-12-26T16:07:00Z

Hi, @Tuyohai. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

You reported that Dify version 0.11.2 cannot parse text from image-only PDF files.
I noted the lack of OCR capabilities and suggested integrating an OCR library.
You inquired about adding OCR recognition or using a third-party service.
Crazywoola mentioned a pending tool review that might resolve this issue, linking to a pull request.

Next Steps:

Please let us know if this issue is still relevant to the latest version of Dify. You can keep the discussion open by commenting on the issue.
If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot bot added cloud When the version is cloud and it is a bug report 🐞 bug Something isn't working labels Nov 25, 2024

crazywoola removed the 🐞 bug Something isn't working label Nov 25, 2024

crazywoola mentioned this issue Nov 26, 2024

When the knowledge base performs segmentation analysis, pictures in the document are lost #11134

Closed

5 tasks

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dify doesn't seem to support parsing image-only PDF files #11063

dify doesn't seem to support parsing image-only PDF files #11063

Tuyohai commented Nov 25, 2024

dosubot bot commented Nov 25, 2024

Tuyohai commented Nov 25, 2024

crazywoola commented Nov 25, 2024

dosubot bot commented Dec 26, 2024

dify doesn't seem to support parsing image-only PDF files #11063

dify doesn't seem to support parsing image-only PDF files #11063

Comments

Tuyohai commented Nov 25, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Nov 25, 2024

Tuyohai commented Nov 25, 2024

crazywoola commented Nov 25, 2024

dosubot bot commented Dec 26, 2024