Skip to content

Conversation

@jamesbraza
Copy link
Collaborator

This PR is the first step in PaperQA becoming multimodal:

  1. Adds reader support for images
  2. Converts Docs to also store images (with text and metadata) as a ParsedMedia
  3. Expands gather_evidence tool to include images in the Context-generation prompt
  4. Adds tests of the prior three steps

@jamesbraza jamesbraza self-assigned this Aug 5, 2025
@jamesbraza jamesbraza added the enhancement New feature or request label Aug 5, 2025
Copilot AI review requested due to automatic review settings August 5, 2025 22:29
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Aug 5, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces multimodal capabilities to PaperQA by adding support for images as a new media type. The key changes enable the system to parse, store, and utilize images alongside text content in the question-answering process.

  • Adds image parsing functionality with a new ParsedMedia class for storing image data and metadata
  • Extends the Text class to include associated media and updates the evidence gathering process to incorporate images
  • Integrates image support into the LLM prompting system for multimodal question answering

Reviewed Changes

Copilot reviewed 8 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/paperqa/types.py Introduces ParsedMedia class for image storage and extends Text with media field
src/paperqa/readers.py Adds parse_image function and updates chunking logic to handle images
src/paperqa/core.py Modifies evidence summarization to include images in LLM prompts
src/paperqa/prompts.py Updates prompt templates to support multimodal content with image integration
src/paperqa/utils.py Adds utility functions for base64 encoding/decoding of image data
src/paperqa/docs.py Updates document addition logic to handle image parsing metadata
src/paperqa/settings.py Excludes image files from indexing until embedding support is added
tests/test_paperqa.py Comprehensive tests for image parsing, storage, and multimodal querying
Comments suppressed due to low confidence (1)

tests/test_paperqa.py:1338

  • This test is comparing an object with itself, which will always be true. Consider testing equality with a separate instance that has the same content to properly test the __eq__ method.
    assert parsed_image == parsed_image, "Expected equality"  # noqa: PLR0124

@jamesbraza jamesbraza force-pushed the multimodal-images branch 2 times, most recently from 12c1d82 to dd29e08 Compare August 5, 2025 22:34
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 5, 2025
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Aug 6, 2025
@jamesbraza jamesbraza merged commit 5675e97 into main Aug 6, 2025
7 checks passed
@jamesbraza jamesbraza deleted the multimodal-images branch August 6, 2025 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants