feat(parsers): added pptx, md, & html parsers#1202
Conversation
There was a problem hiding this comment.
Greptile Summary
This PR adds support for three new file formats (PPTX, Markdown, and HTML) to the knowledge base system while upgrading existing parsers to "state-of-the-art" implementations. The changes include:
New Parser Implementations:
PptxParser: Uses the officeparser library with a fallback text extraction mechanism for PowerPoint filesHtmlParser: Leverages cheerio for comprehensive HTML parsing, extracting structured content while preserving document hierarchy and generating rich metadata (headings, links, images, tables)MdParser: Simple Markdown file parser that follows the established FileParser interface
Parser Upgrades:
- CSV parser migrated from csv-parser to PapaParse with synchronous processing
- PDF parser switched from pdf-parse to pdf-lib for better metadata extraction, maintaining RawPdfParser for text extraction
- DOC parser replaced word-extractor with officeparser for unified Office document handling
UI Integration:
The knowledge base upload modals (create-modal.tsx and upload-modal.tsx) were updated to accept the new MIME types and display updated file format lists to users.
Dependency Management:
Package.json files were updated to replace older parsing libraries (csv-parser, word-extractor, pdf-parse) with modern alternatives (papaparse, officeparser, pdf-lib, cheerio) and added necessary type definitions.
The implementation follows the existing FileParser interface pattern, ensuring consistency with the established architecture. All new parsers include proper error handling, UTF-8 sanitization for safe database storage, and comprehensive test coverage.
Confidence score: 3/5
- This PR introduces significant changes to core parsing functionality with some implementation issues that could affect reliability
- Score reflects concerns about hardcoded empty text extraction in PDF parser, memory usage issues in CSV parser, and missing sanitization in Markdown parser
- Pay close attention to
pdf-parser.ts,csv-parser.ts,md-parser.ts, anddoc-parser.tsfor potential breaking changes and performance issues
18 files reviewed, 9 comments
* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization
* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization
* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization
Summary
added pptx, md, & html parsers because they were missing. added the parsers, and added them as valid upload options in the kb. also upgraded some of the other parsers to the SOTA
Type of Change
Testing
Tested manually.
Checklist