fix(build): consolidate pdf parsing dependencies, remove extraneous html deps#1212
fix(build): consolidate pdf parsing dependencies, remove extraneous html deps#1212waleedlatif1 merged 2 commits intostagingfrom
Conversation
There was a problem hiding this comment.
Greptile Summary
This PR consolidates PDF parsing dependencies and removes extraneous HTML dependencies to fix build conflicts that were preventing GitHub Actions from successfully publishing Docker images. The changes eliminate redundant PDF parsing libraries and streamline the file parsing architecture.
Key Changes Made:
-
PDF Parser Consolidation: Removed the complex dual PDF parsing approach that used both
pdf-liband a customRawPdfParserclass. The codebase now uses only thepdf-parselibrary through a simplifiedPdfParserimplementation. -
Dependency Cleanup: Updated
package.jsonto removepdf-liband add@types/pdf-parsefor TypeScript support. Also addedentitiesfor HTML entity handling andsharpfor image processing, while pinning@types/html-to-textto an exact version. -
Simplified Error Handling: Removed the complex fallback mechanism in the file parsing API route (
apps/sim/app/api/files/parse/route.ts) that would attempt multiple PDF parsers. The system now relies on a single parsing approach with cleaner error handling. -
Code Cleanup: Removed the entire 467-line
RawPdfParserclass and updated the file parser index to only load the mainPdfParser. Also removed an unnecessary comment in the Outlook polling service. -
Test Updates: Updated test mocks to reflect the removal of the
RawPdfParserdependency.
The changes integrate well with the existing file processing system, maintaining the same API surface while reducing complexity. The file parser loader (apps/sim/lib/file-parsers/index.ts) continues to support multiple formats (PDF, CSV, DOCX, TXT, MD, PPTX, HTML) but now uses a simpler, more maintainable architecture for PDF processing.
Confidence score: 3/5
- This PR addresses a real build issue but removes fallback mechanisms that may have provided better error recovery for problematic PDF files
- Score reflects the trade-off between simplified maintenance and potentially reduced robustness in PDF parsing edge cases
- Pay close attention to
apps/sim/lib/file-parsers/pdf-parser.tsandapps/sim/app/api/files/parse/route.tsfor potential PDF parsing failures with complex documents
7 files reviewed, no comments
…tml deps (#1212) * fix(build): consolidate pdf parsing dependencies, remove extraneous html deps * add types
…tml deps (#1212) * fix(build): consolidate pdf parsing dependencies, remove extraneous html deps * add types
…tml deps (simstudioai#1212) * fix(build): consolidate pdf parsing dependencies, remove extraneous html deps * add types
Summary
consolidate pdf parsing dependencies, remove extraneous html deps, we had multiple pdf parse libraries but only really need one. we also explicitly declared a downstream dependency that we didn't need to declare, which caused conflicts and let to the failed gh action to publish the image
Type of Change
Testing
Tested manually.
Checklist