-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
Description
It would be great if langextract could support multimodal input, allowing users to pass both textual and visual data directly to the model without pre-processing images via VL model. This would leverage modern multimodal models capable of handling image understanding tasks together with text.
Currently, when working with image-rich documents or visual datasets, we need to run VL model to convert images to text before extraction. This loses potential visual context informationand adds extra processing steps. Many recent LLMs can directly accept images alongside text prompts.
Supporting this natively in langextract would:
- Save processing time
- Preserve visual context
- Enable richer extraction capabilities from images
- Is there any plan to support multimodal extraction (text + image) in langextract?
- Do you have any recommended best practices or existing approaches for this scenario?
- If there’s no current plan, would you welcome community contributions for such a feature?
liadlevy-pando, vayoa, daltunay, vvanglro, ghd551876-web and 3 moreaksg87 and Haasha