Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dify doesn't seem to support parsing image-only PDF files #11063

Open
5 tasks done
Tuyohai opened this issue Nov 25, 2024 · 4 comments
Open
5 tasks done

dify doesn't seem to support parsing image-only PDF files #11063

Tuyohai opened this issue Nov 25, 2024 · 4 comments
Labels
cloud When the version is cloud and it is a bug report stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed

Comments

@Tuyohai
Copy link
Contributor

Tuyohai commented Nov 25, 2024

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.11.2

Cloud or Self Hosted

Cloud

Steps to reproduce

Upload image-only PDF to the knowledge base.

✔️ Expected Behavior

Parse out the text in the image.

❌ Actual Behavior

The parsed result is empty.

Copy link

dosubot bot commented Nov 25, 2024

Hey @Tuyohai! I'm here to assist you with any bugs, questions, or contributions for Dify. Let's tackle this issue together!

I found a similar unsolved discussion regarding Dify's ability to parse images in PDFs and embed them into vector databases [1]. Additionally, there are no existing solutions or built-in OCR capabilities in Dify for extracting text from images within PDFs. The current PDF processing in Dify uses the pypdfium2 library, which does not support OCR for image-only PDFs [2].

To extend Dify's capabilities to include OCR for image-only PDFs, you might need to integrate an OCR library like Tesseract or a similar tool that can handle image-based text extraction.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added cloud When the version is cloud and it is a bug report 🐞 bug Something isn't working labels Nov 25, 2024
@Tuyohai
Copy link
Contributor Author

Tuyohai commented Nov 25, 2024

Whether to consider adding OCR recognition, may deploy an OCR model, or use a third-party service like the LLM Service

@crazywoola crazywoola removed the 🐞 bug Something isn't working label Nov 25, 2024
@crazywoola
Copy link
Member

#10829 We do have a tool pending on review.

Copy link

dosubot bot commented Dec 26, 2024

Hi, @Tuyohai. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported that Dify version 0.11.2 cannot parse text from image-only PDF files.
  • I noted the lack of OCR capabilities and suggested integrating an OCR library.
  • You inquired about adding OCR recognition or using a third-party service.
  • Crazywoola mentioned a pending tool review that might resolve this issue, linking to a pull request.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of Dify. You can keep the discussion open by commenting on the issue.
  • If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud When the version is cloud and it is a bug report stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
None yet
Development

No branches or pull requests

2 participants