Conversation
… into feature/pdf-extraction
… ranged extraction
… into feature/pdf-extraction
There was a problem hiding this comment.
Pull request overview
This PR implements PDF text extraction functionality using PyMuPDF, featuring text normalization, hyphen repair, cross-page text stitching, OCR fallback, and SHA-256 file identification. The implementation adds a new pdfx module with extraction capabilities and associated tests.
Changes:
- Adds PDF extraction module with text processing, normalization, and OCR support
- Implements CLI interface for PDF extraction with page range support
- Adds test suite for basic extraction, file output, and ranged extraction
- Updates .gitignore to exclude PDF input/output directories
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| backend/app/pdfx/pdfx.py | Core PDF extraction module with text normalization, hyphen repair, and OCR fallback |
| backend/tests/test_pdfx.py | Test suite covering basic extraction, output integrity, and page range functionality |
| backend/requirements-dev.txt | Adds development dependencies (pytest, black, ruff, mypy) |
| backend/.gitignore | Excludes PDF input/output directories from version control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,246 @@ | |||
| """ | |||
There was a problem hiding this comment.
The pdfx module is missing an init.py file. All other modules in the app directory (api, core, models, textGeneration, threadIngestion) have init.py files, following the established codebase convention for Python packages.
… into feature/pdf-extraction
📝 Description
Implemented the pdf text extraction script for the pre-chunking and associated tests.
Primary features:
--ocr)Added dependencies:
pymupdfeasyocrTry Local Extraction
### local test extraction python -m app.pdfx.pdfx extract app/path/to/pdf --out app/path/to/output_directory --page-range 3-4 --ocr TrueExample Payload Structure
{ "doc_uuid": "b6ccb18c-8581-54f2-ba30-a7d509a6483b", "page_count": ..., "processed_page_range": [...], "processed_pages": [...], "total_word_count": ..., "created_at": "2025-11-26T06:45:05.341631+00:00", "tool_version": "0.1.0", "skipped": true/false, "skipped_pages": [...], "text": "...", "pages": [ { "page_num": ..., "word_count": ..., "used_ocr": ..., "text": "" } ] }🎯 Type of Change
🧪 Testing
📋 Checklist
📸 Screenshots (if applicable)
Add screenshots or GIFs to help explain your changes.
🔗 Related Issues
Closes #(issue number)