Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

I realized we were passing the same images for Context creation. This PR:

  1. Adds an info_hash to ParsedMedia to allow for loosened deduplication without affecting stored info
    • Now: can store high-resolution bbox, but have low-resolution bbox with deduplication
  2. Uses ordered set-based deduplication for the Context creation prompt

This PR also makes a custom PDF for testing deduplication capabilities within multimodal PaperQA.

@jamesbraza jamesbraza self-assigned this Oct 27, 2025
Copilot AI review requested due to automatic review settings October 27, 2025 16:51
@jamesbraza jamesbraza added the bug Something isn't working label Oct 27, 2025
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 27, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements media deduplication for Context creation to avoid passing duplicate images to the LLM. It introduces an info_hash field in ParsedMedia to enable deduplication based on rounded bounding boxes while preserving high-resolution metadata, and applies ordered set-based deduplication when creating context prompts.

Key changes:

  • Added info_hash to ParsedMedia for flexible deduplication without affecting stored metadata
  • Applied deduplication to media lists during context creation in _map_fxn_summary
  • Created test PDF with duplicate media to validate deduplication behavior

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/paperqa/types.py Added _get_info_hash() method to support deduplication via optional info_hash key
src/paperqa/core.py Implemented ordered set deduplication for media during context creation
packages/paper-qa-docling/src/paperqa_docling/reader.py Added info_hash generation with rounded bounding boxes for pictures and tables
tests/test_paperqa.py Added test to verify media deduplication during context creation
packages/paper-qa-docling/tests/test_paperqa_docling.py Added test to verify ParsedMedia deduplication behavior
tests/duplicate_media_template.md Template for generating test PDF with duplicate images
src/paperqa/prompts.py Removed redundant separator between text and tables
tests/test_agents.py Updated expected file counts to include new test PDF

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dosubot dosubot bot added the enhancement New feature or request label Oct 27, 2025
@dosubot
Copy link

dosubot bot commented Oct 27, 2025

Documentation Updates

1 document(s) were updated by changes in this PR

How did I do? Any feedback?  Join Discord

@jamesbraza jamesbraza force-pushed the deduplicating-images branch from 1d66dd7 to 5299143 Compare October 27, 2025 16:57
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 27, 2025
@jamesbraza jamesbraza force-pushed the deduplicating-images branch 2 times, most recently from ef669d6 to 4e954cc Compare October 27, 2025 20:07
@jamesbraza jamesbraza force-pushed the deduplicating-images branch from 4e954cc to 7b4e1f5 Compare October 27, 2025 20:11
@jamesbraza jamesbraza merged commit 80149a1 into main Oct 27, 2025
9 checks passed
@jamesbraza jamesbraza deleted the deduplicating-images branch October 27, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants