Skip to content

Feature/pdf extraction#58

Open
evansun06 wants to merge 22 commits intomainfrom
feature/pdf-extraction
Open

Feature/pdf extraction#58
evansun06 wants to merge 22 commits intomainfrom
feature/pdf-extraction

Conversation

@evansun06
Copy link
Contributor

@evansun06 evansun06 commented Nov 26, 2025

📝 Description

Implemented the pdf text extraction script for the pre-chunking and associated tests.

Primary features:

  • newline normalization
  • hyphen repair
  • page-page repair
  • quotation and character normalization
  • sha256 unique file identifier
  • easyOCR fallback (configurable with --ocr)

Added dependencies:

  • pymupdf
  • easyocr

Try Local Extraction

### local test extraction
python -m app.pdfx.pdfx extract app/path/to/pdf --out app/path/to/output_directory --page-range 3-4 --ocr True

Example Payload Structure

{
  "doc_uuid": "b6ccb18c-8581-54f2-ba30-a7d509a6483b",
  "page_count": ...,
  "processed_page_range": [...],
  "processed_pages": [...],
  "total_word_count": ...,
  "created_at": "2025-11-26T06:45:05.341631+00:00",
  "tool_version": "0.1.0",
  "skipped": true/false,
  "skipped_pages": [...],
  "text": "...",
  "pages": [
    {
      "page_num": ...,
      "word_count": ...,
      "used_ocr": ...,
      "text": ""
    }
  ]
}

🎯 Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔨 Refactoring (no functional changes)
  • 🧪 Tests (adding or updating tests)
  • 🔧 Chore (dependency updates, config changes, etc.)

🧪 Testing

  • I have tested this change locally
  • I have added/updated tests for this change
  • All existing tests pass

📋 Checklist

  • My code follows the code style of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

📸 Screenshots (if applicable)

Add screenshots or GIFs to help explain your changes.

🔗 Related Issues

Closes #(issue number)

TonyLiu0226
TonyLiu0226 previously approved these changes Nov 29, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements PDF text extraction functionality using PyMuPDF, featuring text normalization, hyphen repair, cross-page text stitching, OCR fallback, and SHA-256 file identification. The implementation adds a new pdfx module with extraction capabilities and associated tests.

Changes:

  • Adds PDF extraction module with text processing, normalization, and OCR support
  • Implements CLI interface for PDF extraction with page range support
  • Adds test suite for basic extraction, file output, and ranged extraction
  • Updates .gitignore to exclude PDF input/output directories

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 14 comments.

File Description
backend/app/pdfx/pdfx.py Core PDF extraction module with text normalization, hyphen repair, and OCR fallback
backend/tests/test_pdfx.py Test suite covering basic extraction, output integrity, and page range functionality
backend/requirements-dev.txt Adds development dependencies (pytest, black, ruff, mypy)
backend/.gitignore Excludes PDF input/output directories from version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,246 @@
"""
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pdfx module is missing an init.py file. All other modules in the app directory (api, core, models, textGeneration, threadIngestion) have init.py files, following the established codebase convention for Python packages.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants