Deduplicating media on `Context` creation #1153

jamesbraza · 2025-10-27T16:51:20Z

I realized we were passing the same images for Context creation. This PR:

Adds an info_hash to ParsedMedia to allow for loosened deduplication without affecting stored info
- Now: can store high-resolution bbox, but have low-resolution bbox with deduplication
Uses ordered set-based deduplication for the Context creation prompt

This PR also makes a custom PDF for testing deduplication capabilities within multimodal PaperQA.

Copilot

Pull Request Overview

This PR implements media deduplication for Context creation to avoid passing duplicate images to the LLM. It introduces an info_hash field in ParsedMedia to enable deduplication based on rounded bounding boxes while preserving high-resolution metadata, and applies ordered set-based deduplication when creating context prompts.

Key changes:

Added info_hash to ParsedMedia for flexible deduplication without affecting stored metadata
Applied deduplication to media lists during context creation in _map_fxn_summary
Created test PDF with duplicate media to validate deduplication behavior

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/paperqa/types.py	Added `_get_info_hash()` method to support deduplication via optional `info_hash` key
src/paperqa/core.py	Implemented ordered set deduplication for media during context creation
packages/paper-qa-docling/src/paperqa_docling/reader.py	Added `info_hash` generation with rounded bounding boxes for pictures and tables
tests/test_paperqa.py	Added test to verify media deduplication during context creation
packages/paper-qa-docling/tests/test_paperqa_docling.py	Added test to verify ParsedMedia deduplication behavior
tests/duplicate_media_template.md	Template for generating test PDF with duplicate images
src/paperqa/prompts.py	Removed redundant separator between text and tables
tests/test_agents.py	Updated expected file counts to include new test PDF

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dosubot · 2025-10-27T16:53:22Z

Documentation Updates

1 document(s) were updated by changes in this PR

Multimodal Support in PaperQA (View Changes)

^{How did I do? Any feedback?}

packages/paper-qa-docling/src/paperqa_docling/reader.py

src/paperqa/prompts.py

adjusting assertions as needed

…_media.pdf

This reverts commit 6c628c033bfec8ca64e36887b3ed6004e11b609d.

jamesbraza requested review from maykcaldas, mskarlin, nadolskit, sidnarayanan and whitead October 27, 2025 16:51

jamesbraza self-assigned this Oct 27, 2025

Copilot AI review requested due to automatic review settings October 27, 2025 16:51

jamesbraza added the bug Something isn't working label Oct 27, 2025

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 27, 2025

Copilot AI reviewed Oct 27, 2025

View reviewed changes

dosubot bot added the enhancement New feature or request label Oct 27, 2025

jamesbraza force-pushed the deduplicating-images branch from 1d66dd7 to 5299143 Compare October 27, 2025 16:57

sidnarayanan approved these changes Oct 27, 2025

View reviewed changes

packages/paper-qa-docling/src/paperqa_docling/reader.py Outdated Show resolved Hide resolved

src/paperqa/prompts.py Show resolved Hide resolved

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 27, 2025

jamesbraza added 2 commits October 27, 2025 12:43

Added 'info_hash' to ParsedMedia to enable controlled deduplication

31ec905

Created duplicate_media.pdf to integration test media deduplication,

6dc97aa

adjusting assertions as needed

jamesbraza force-pushed the deduplicating-images branch 2 times, most recently from ef669d6 to 4e954cc Compare October 27, 2025 20:07

jamesbraza added 5 commits October 27, 2025 13:11

Added info_hash to Docling's ParsedMedia, with a test using duplicate…

1c6d490

…_media.pdf

Added media deduplication to Context creation, with a test

a0b808a

Simplified prompting around Markdown table text

c77d0b3

Renamed info_hash to info_hashable to be clear

61864ad

Revert "Simplified prompting around Markdown table text"

7b4e1f5

This reverts commit 6c628c033bfec8ca64e36887b3ed6004e11b609d.

jamesbraza force-pushed the deduplicating-images branch from 4e954cc to 7b4e1f5 Compare October 27, 2025 20:11

jamesbraza merged commit 80149a1 into main Oct 27, 2025
9 checks passed

jamesbraza deleted the deduplicating-images branch October 27, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deduplicating media on `Context` creation #1153

Deduplicating media on `Context` creation #1153

Uh oh!

jamesbraza commented Oct 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

dosubot bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Deduplicating media on Context creation #1153

Deduplicating media on Context creation #1153

Uh oh!

Conversation

jamesbraza commented Oct 27, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

dosubot bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Deduplicating media on `Context` creation #1153

Deduplicating media on `Context` creation #1153

dosubot bot commented Oct 27, 2025 •

edited

Loading