Skip to content

fix(build): consolidate pdf parsing dependencies, remove extraneous html deps#1212

Merged
waleedlatif1 merged 2 commits intostagingfrom
fix/build
Sep 1, 2025
Merged

fix(build): consolidate pdf parsing dependencies, remove extraneous html deps#1212
waleedlatif1 merged 2 commits intostagingfrom
fix/build

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

consolidate pdf parsing dependencies, remove extraneous html deps, we had multiple pdf parse libraries but only really need one. we also explicitly declared a downstream dependency that we didn't need to declare, which caused conflicts and let to the failed gh action to publish the image

Type of Change

  • Bug fix

Testing

Tested manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Sep 1, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
docs Ready Ready Preview Comment Sep 1, 2025 5:19pm
sim Building Building Preview Comment Sep 1, 2025 5:19pm

@waleedlatif1 waleedlatif1 merged commit ea09fce into staging Sep 1, 2025
2 of 4 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/build branch September 1, 2025 17:19
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR consolidates PDF parsing dependencies and removes extraneous HTML dependencies to fix build conflicts that were preventing GitHub Actions from successfully publishing Docker images. The changes eliminate redundant PDF parsing libraries and streamline the file parsing architecture.

Key Changes Made:

  1. PDF Parser Consolidation: Removed the complex dual PDF parsing approach that used both pdf-lib and a custom RawPdfParser class. The codebase now uses only the pdf-parse library through a simplified PdfParser implementation.

  2. Dependency Cleanup: Updated package.json to remove pdf-lib and add @types/pdf-parse for TypeScript support. Also added entities for HTML entity handling and sharp for image processing, while pinning @types/html-to-text to an exact version.

  3. Simplified Error Handling: Removed the complex fallback mechanism in the file parsing API route (apps/sim/app/api/files/parse/route.ts) that would attempt multiple PDF parsers. The system now relies on a single parsing approach with cleaner error handling.

  4. Code Cleanup: Removed the entire 467-line RawPdfParser class and updated the file parser index to only load the main PdfParser. Also removed an unnecessary comment in the Outlook polling service.

  5. Test Updates: Updated test mocks to reflect the removal of the RawPdfParser dependency.

The changes integrate well with the existing file processing system, maintaining the same API surface while reducing complexity. The file parser loader (apps/sim/lib/file-parsers/index.ts) continues to support multiple formats (PDF, CSV, DOCX, TXT, MD, PPTX, HTML) but now uses a simpler, more maintainable architecture for PDF processing.

Confidence score: 3/5

  • This PR addresses a real build issue but removes fallback mechanisms that may have provided better error recovery for problematic PDF files
  • Score reflects the trade-off between simplified maintenance and potentially reduced robustness in PDF parsing edge cases
  • Pay close attention to apps/sim/lib/file-parsers/pdf-parser.ts and apps/sim/app/api/files/parse/route.ts for potential PDF parsing failures with complex documents

7 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

waleedlatif1 added a commit that referenced this pull request Sep 1, 2025
…tml deps (#1212)

* fix(build): consolidate pdf parsing dependencies, remove extraneous html deps

* add types
waleedlatif1 added a commit that referenced this pull request Sep 2, 2025
…tml deps (#1212)

* fix(build): consolidate pdf parsing dependencies, remove extraneous html deps

* add types
arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025
…tml deps (simstudioai#1212)

* fix(build): consolidate pdf parsing dependencies, remove extraneous html deps

* add types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant