feat(parsers): added pptx, md, & html parsers by waleedlatif1 · Pull Request #1202 · simstudioai/sim

waleedlatif1 · 2025-08-30T08:44:09Z

Summary

added pptx, md, & html parsers because they were missing. added the parsers, and added them as valid upload options in the kb. also upgraded some of the other parsers to the SOTA

Type of Change

New feature

Testing

Tested manually.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2025-08-30T08:44:15Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
sim	Ready	Preview	Comment	Aug 30, 2025 9:14am

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
docs	Skipped			Aug 30, 2025 9:14am

greptile-apps

Greptile Summary

This PR adds support for three new file formats (PPTX, Markdown, and HTML) to the knowledge base system while upgrading existing parsers to "state-of-the-art" implementations. The changes include:

New Parser Implementations:

PptxParser: Uses the officeparser library with a fallback text extraction mechanism for PowerPoint files
HtmlParser: Leverages cheerio for comprehensive HTML parsing, extracting structured content while preserving document hierarchy and generating rich metadata (headings, links, images, tables)
MdParser: Simple Markdown file parser that follows the established FileParser interface

Parser Upgrades:

CSV parser migrated from csv-parser to PapaParse with synchronous processing
PDF parser switched from pdf-parse to pdf-lib for better metadata extraction, maintaining RawPdfParser for text extraction
DOC parser replaced word-extractor with officeparser for unified Office document handling

UI Integration:
The knowledge base upload modals (create-modal.tsx and upload-modal.tsx) were updated to accept the new MIME types and display updated file format lists to users.

Dependency Management:
Package.json files were updated to replace older parsing libraries (csv-parser, word-extractor, pdf-parse) with modern alternatives (papaparse, officeparser, pdf-lib, cheerio) and added necessary type definitions.

The implementation follows the existing FileParser interface pattern, ensuring consistency with the established architecture. All new parsers include proper error handling, UTF-8 sanitization for safe database storage, and comprehensive test coverage.

Confidence score: 3/5

This PR introduces significant changes to core parsing functionality with some implementation issues that could affect reliability
Score reflects concerns about hardcoded empty text extraction in PDF parser, memory usage issues in CSV parser, and missing sanitization in Markdown parser
Pay close attention to pdf-parser.ts, csv-parser.ts, md-parser.ts, and doc-parser.ts for potential breaking changes and performance issues

_{18 files reviewed, 9 comments}

_{Edit Code Review Bot Settings | Greptile}

apps/sim/lib/file-parsers/pdf-parser.ts

apps/sim/app/api/files/parse/route.ts

apps/sim/lib/file-parsers/doc-parser.ts

apps/sim/lib/file-parsers/csv-parser.ts

apps/sim/lib/file-parsers/md-parser.ts

apps/sim/lib/file-parsers/html-parser.ts

* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization

feat(parsers): added pptx, md, & html parsers

03d9011

greptile-apps bot reviewed Aug 30, 2025

View reviewed changes

vercel bot deployed to Preview – sim August 30, 2025 08:46 View deployment

vercel bot deployed to Preview – docs August 30, 2025 08:47 View deployment

ack PR comments

2de734d

vercel bot temporarily deployed to Preview – docs August 30, 2025 08:58 Inactive

vercel bot deployed to Preview – sim August 30, 2025 09:03 View deployment

file renaming, reorganization

b2011a6

vercel bot temporarily deployed to Preview – docs August 30, 2025 09:06 Inactive

waleedlatif1 merged commit a969d09 into staging Aug 30, 2025
4 of 5 checks passed

waleedlatif1 deleted the feat/kb-md branch August 30, 2025 09:11

vercel bot deployed to Preview – sim August 30, 2025 09:14 View deployment

Sg312 pushed a commit that referenced this pull request Aug 30, 2025

feat(parsers): added pptx, md, & html parsers (#1202)

2404f8a

* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization

waleedlatif1 mentioned this pull request Aug 31, 2025

v0.3.43: added additional parsers, mysql block improvements, billing fixes, permission fixes #1207

Merged

waleedlatif1 added a commit that referenced this pull request Sep 1, 2025

feat(parsers): added pptx, md, & html parsers (#1202)

60a8649

* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization

arenadeveloper02 pushed a commit to arenadeveloper02/p2-sim that referenced this pull request Sep 19, 2025

feat(parsers): added pptx, md, & html parsers (simstudioai#1202)

7ec8295

* feat(parsers): added pptx, md, & html parsers * ack PR comments * file renaming, reorganization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parsers): added pptx, md, & html parsers#1202

feat(parsers): added pptx, md, & html parsers#1202
waleedlatif1 merged 3 commits intostagingfrom
feat/kb-md

waleedlatif1 commented Aug 30, 2025

Uh oh!

vercel bot commented Aug 30, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

waleedlatif1 commented Aug 30, 2025

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel bot commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Aug 30, 2025 •

edited

Loading