Skip to content

Feature Request: Support multimodal extraction (text + image) #270

@ojipadeson

Description

@ojipadeson

It would be great if langextract could support multimodal input, allowing users to pass both textual and visual data directly to the model without pre-processing images via VL model. This would leverage modern multimodal models capable of handling image understanding tasks together with text.

Currently, when working with image-rich documents or visual datasets, we need to run VL model to convert images to text before extraction. This loses potential visual context informationand adds extra processing steps. Many recent LLMs can directly accept images alongside text prompts.
Supporting this natively in langextract would:

  • Save processing time
  • Preserve visual context
  • Enable richer extraction capabilities from images
  1. Is there any plan to support multimodal extraction (text + image) in langextract?
  2. Do you have any recommended best practices or existing approaches for this scenario?
  3. If there’s no current plan, would you welcome community contributions for such a feature?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions