Feature/pdf extraction by evansun06 · Pull Request #58 · ubclaunchpad/Piazza-AI-Plugin

evansun06 · 2025-11-26T07:42:40Z

📝 Description

Implemented the pdf text extraction script for the pre-chunking and associated tests.

Primary features:

newline normalization
hyphen repair
page-page repair
quotation and character normalization
sha256 unique file identifier
easyOCR fallback (configurable with --ocr)

Added dependencies:

pymupdf
easyocr

Try Local Extraction

### local test extraction
python -m app.pdfx.pdfx extract app/path/to/pdf --out app/path/to/output_directory --page-range 3-4 --ocr True

Example Payload Structure

{
  "doc_uuid": "b6ccb18c-8581-54f2-ba30-a7d509a6483b",
  "page_count": ...,
  "processed_page_range": [...],
  "processed_pages": [...],
  "total_word_count": ...,
  "created_at": "2025-11-26T06:45:05.341631+00:00",
  "tool_version": "0.1.0",
  "skipped": true/false,
  "skipped_pages": [...],
  "text": "...",
  "pages": [
    {
      "page_num": ...,
      "word_count": ...,
      "used_ocr": ...,
      "text": ""
    }
  ]
}

🎯 Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔨 Refactoring (no functional changes)
🧪 Tests (adding or updating tests)
🔧 Chore (dependency updates, config changes, etc.)

🧪 Testing

I have tested this change locally
I have added/updated tests for this change
All existing tests pass

📋 Checklist

My code follows the code style of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

📸 Screenshots (if applicable)

Add screenshots or GIFs to help explain your changes.

🔗 Related Issues

Closes #(issue number)

…tion signature

…arser

…ics accurately

…dentification

… into feature/pdf-extraction

… ranged extraction

… into feature/pdf-extraction

Copilot

Pull request overview

This PR implements PDF text extraction functionality using PyMuPDF, featuring text normalization, hyphen repair, cross-page text stitching, OCR fallback, and SHA-256 file identification. The implementation adds a new pdfx module with extraction capabilities and associated tests.

Changes:

Adds PDF extraction module with text processing, normalization, and OCR support
Implements CLI interface for PDF extraction with page range support
Adds test suite for basic extraction, file output, and ranged extraction
Updates .gitignore to exclude PDF input/output directories

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 14 comments.

File	Description
backend/app/pdfx/pdfx.py	Core PDF extraction module with text normalization, hyphen repair, and OCR fallback
backend/tests/test_pdfx.py	Test suite covering basic extraction, output integrity, and page range functionality
backend/requirements-dev.txt	Adds development dependencies (pytest, black, ruff, mypy)
backend/.gitignore	Excludes PDF input/output directories from version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/app/pdfx/pdfx.py

Copilot · 2026-01-27T04:54:16Z

backend/app/pdfx/pdfx.py

@@ -0,0 +1,246 @@
+"""


The pdfx module is missing an init.py file. All other modules in the app directory (api, core, models, textGeneration, threadIngestion) have init.py files, following the established codebase convention for Python packages.

backend/tests/test_pdfx.py

backend/app/pdfx/pdfx.py

backend/tests/test_pdfx.py

… into feature/pdf-extraction

evansun06 added 14 commits November 3, 2025 21:41

feat(pdfx): setup and recorded dependencies and python module

7e6c1c0

feat(pdfx): created basic cli setup including health_check and extrac…

ade88f4

…tion signature

feat(pdfx): added unique file hashing

ba1f768

feat(pdfx): added test pdfs (1 simple, 1 complex)

5e742af

feat(pdfx): Added write logic, and cli output path handling

1726621

added page variability to cli options, additionally most basic text p…

b7c072b

…arser

feat(pdfx): able to extract text, need to implement structure heurist…

d4dd103

…ics accurately

feat(pdfx): added block metrics and added hyphenation fixing + list i…

3e45845

…dentification

Merge branch 'main' of https://github.com/ubclaunchpad/Piazza-AI-Plugin…

6225504

… into feature/pdf-extraction

refactor(pdfx): refactored pdf extraction script to use PyMuPDF and OCR

ed63238

feat(pdfx): included response codes + CLI integrtion to script

6b23a6c

feat(pdfx): implemented correct ranged extraction

2c8caa5

test(pdfx): added tests for json structure, return path checking, and…

e61a875

… ranged extraction

chore(pdfx): fixed linting + styling

4069ada

evansun06 requested review from TonyLiu0226 and hamin2006 as code owners November 26, 2025 07:42

chore(pdfx): applied ruff formatting

4ec9719

TonyLiu0226 previously approved these changes Nov 29, 2025

View reviewed changes

Merge branch 'main' of https://github.com/ubclaunchpad/Piazza-AI-Plugin…

cb44967

… into feature/pdf-extraction

evansun06 dismissed TonyLiu0226’s stale review via cb44967 January 27, 2026 04:47

Copilot AI review requested due to automatic review settings January 27, 2026 04:47

Copilot started reviewing on behalf of evansun06 January 27, 2026 04:47 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

evansun06 added 6 commits January 26, 2026 21:19

feat(pdfx): integrate easyocr

ee7af49

feat(ingest): added tests for ocr

f2b6e58

chore(ingest): update dependency lockfile

e0f69c6

feat(pdfx): create optional ocr flag

062d38b

chore(pdfx): apply styling and linting

88d7f5c

Merge branch 'main' of https://github.com/ubclaunchpad/Piazza-AI-Plugin…

ef40cd9

… into feature/pdf-extraction

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pdf extraction#58

Feature/pdf extraction#58
evansun06 wants to merge 22 commits intomainfrom
feature/pdf-extraction

evansun06 commented Nov 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

evansun06 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Description

🎯 Type of Change

🧪 Testing

📋 Checklist

📸 Screenshots (if applicable)

🔗 Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evansun06 commented Nov 26, 2025 •

edited

Loading