Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enhance file processing capabilities with PDF support #7236

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rodrigosnader
Copy link
Contributor

  • Added support for processing PDF files using the unstructured library in the FileComponent.
  • Updated the process_file method to return a list of Data objects when processing PDFs.
  • Introduced a new private method _process_pdf_with_unstructured to handle PDF extraction and conversion to Data objects.
  • Updated pyproject.toml to include the unstructured library as a dependency for PDF processing.
  • Refactored starter project JSON files to ensure compatibility with the new data structure.

- Added support for processing PDF files using the unstructured library in the FileComponent.
- Updated the process_file method to return a list of Data objects when processing PDFs.
- Introduced a new private method _process_pdf_with_unstructured to handle PDF extraction and conversion to Data objects.
- Updated pyproject.toml to include the unstructured library as a dependency for PDF processing.
- Refactored starter project JSON files to ensure compatibility with the new data structure.
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Mar 23, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Mar 23, 2025
@rodrigosnader
Copy link
Contributor Author

image

This adds the ability for PDF to be partioned using Unstructured.

Need to evaluate for:

  1. Strategy (unstructure has auto, fast, high-res, etc.). Not implemented as an option in the component yet (high-res might ask for other dependencies.)
  2. Other file types - this includes PDF only.

@@ -120,6 +120,7 @@ dependencies = [
"langchain-graph-retriever==0.6.1",
"graph-retriever==0.6.1",
"opik>=1.6.3",
"unstructured[pdf]>=0.17.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds torch as a dependency which we should never do. Try to see if there's a different extra that avoids this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants