Feature Request: Support multimodal extraction (text + image)

It would be great if langextract could support multimodal input, allowing users to pass both textual and visual data directly to the model without pre-processing images via VL model. This would leverage modern multimodal models capable of handling image understanding tasks together with text.

Currently, when working with image-rich documents or visual datasets, we need to run VL model to convert images to text before extraction. This loses potential visual context informationand adds extra processing steps. Many recent LLMs can directly accept images alongside text prompts.
Supporting this natively in langextract would:
- Save processing time
- Preserve visual context
- Enable richer extraction capabilities from images

1. Is there any plan to support multimodal extraction (text + image) in langextract?
2. Do you have any recommended best practices or existing approaches for this scenario?
3. If there’s no current plan, would you welcome community contributions for such a feature?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Support multimodal extraction (text + image) #270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Support multimodal extraction (text + image) #270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions