Skip to content

feat(parsers): added pptx, md, & html parsers#1202

Merged
waleedlatif1 merged 3 commits intostagingfrom
feat/kb-md
Aug 30, 2025
Merged

feat(parsers): added pptx, md, & html parsers#1202
waleedlatif1 merged 3 commits intostagingfrom
feat/kb-md

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

added pptx, md, & html parsers because they were missing. added the parsers, and added them as valid upload options in the kb. also upgraded some of the other parsers to the SOTA

Type of Change

  • New feature

Testing

Tested manually.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Aug 30, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
sim Ready Ready Preview Comment Aug 30, 2025 9:14am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
docs Skipped Skipped Aug 30, 2025 9:14am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds support for three new file formats (PPTX, Markdown, and HTML) to the knowledge base system while upgrading existing parsers to "state-of-the-art" implementations. The changes include:

New Parser Implementations:

  • PptxParser: Uses the officeparser library with a fallback text extraction mechanism for PowerPoint files
  • HtmlParser: Leverages cheerio for comprehensive HTML parsing, extracting structured content while preserving document hierarchy and generating rich metadata (headings, links, images, tables)
  • MdParser: Simple Markdown file parser that follows the established FileParser interface

Parser Upgrades:

  • CSV parser migrated from csv-parser to PapaParse with synchronous processing
  • PDF parser switched from pdf-parse to pdf-lib for better metadata extraction, maintaining RawPdfParser for text extraction
  • DOC parser replaced word-extractor with officeparser for unified Office document handling

UI Integration:
The knowledge base upload modals (create-modal.tsx and upload-modal.tsx) were updated to accept the new MIME types and display updated file format lists to users.

Dependency Management:
Package.json files were updated to replace older parsing libraries (csv-parser, word-extractor, pdf-parse) with modern alternatives (papaparse, officeparser, pdf-lib, cheerio) and added necessary type definitions.

The implementation follows the existing FileParser interface pattern, ensuring consistency with the established architecture. All new parsers include proper error handling, UTF-8 sanitization for safe database storage, and comprehensive test coverage.

Confidence score: 3/5

  • This PR introduces significant changes to core parsing functionality with some implementation issues that could affect reliability
  • Score reflects concerns about hardcoded empty text extraction in PDF parser, memory usage issues in CSV parser, and missing sanitization in Markdown parser
  • Pay close attention to pdf-parser.ts, csv-parser.ts, md-parser.ts, and doc-parser.ts for potential breaking changes and performance issues

18 files reviewed, 9 comments

Edit Code Review Bot Settings | Greptile

@vercel vercel bot temporarily deployed to Preview – docs August 30, 2025 09:06 Inactive
@waleedlatif1 waleedlatif1 merged commit a969d09 into staging Aug 30, 2025
4 of 5 checks passed
@waleedlatif1 waleedlatif1 deleted the feat/kb-md branch August 30, 2025 09:11
Sg312 pushed a commit that referenced this pull request Aug 30, 2025
* feat(parsers): added pptx, md, & html parsers

* ack PR comments

* file renaming, reorganization
waleedlatif1 added a commit that referenced this pull request Sep 1, 2025
* feat(parsers): added pptx, md, & html parsers

* ack PR comments

* file renaming, reorganization
arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025
* feat(parsers): added pptx, md, & html parsers

* ack PR comments

* file renaming, reorganization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant