feat(ai): add vision support for multimodal messages by willgriffin · Pull Request #683 · happyvertical/sdk

willgriffin · 2025-12-19T02:04:38Z

Summary

Add support for vision-capable LLMs by extending AIMessage.content to accept both simple strings and multimodal content arrays (ContentPart[]).

Changes:

Add ContentPart types (TextContentPart, ImageContentPart) for multimodal messages
Add extractTextContent() helper function for backward compatibility
Update AIMessage.content type to accept string | ContentPart[]
Update OpenAI provider to properly map ContentPart arrays to OpenAI format
Update all other providers (Anthropic, Bedrock, Claude-CLI, Gemini, HuggingFace) to use extractTextContent() for backward compatibility

Usage

This enables vision-capable models to accept images inline:

import { getAI } from '@happyvertical/ai';

const ai = await getAI({ type: 'openai' });

const response = await ai.chat([
  {
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { 
        type: 'image_url', 
        image_url: { 
          url: 'data:image/png;base64,iVBORw0KGgo...', 
          detail: 'high' 
        } 
      }
    ]
  }
]);

Motivation

Required for @happyvertical/ocr LiteLLM provider which uses vision-capable LLMs for OCR text extraction.

See: happyvertical/ocr#21

Test plan

All existing tests pass (71 tests, 10 skipped for API keys)
TypeScript compilation passes
Manual testing with vision-capable model (GPT-4o, DeepSeek-VL)

Add support for vision-capable LLMs by extending AIMessage.content to accept both simple strings and multimodal content arrays (ContentPart[]). Changes: - Add ContentPart types (TextContentPart, ImageContentPart) for multimodal messages - Add extractTextContent() helper function for backward compatibility - Update AIMessage.content type to accept string | ContentPart[] - Update OpenAI provider to properly map ContentPart arrays to OpenAI format - Update all other providers (Anthropic, Bedrock, Claude-CLI, Gemini, HuggingFace) to use extractTextContent() for backward compatibility This enables usage like: ```typescript await ai.chat([ { role: 'user', content: [ { type: 'text', text: 'What is in this image?' }, { type: 'image_url', image_url: { url: 'data:image/png;base64,...' } } ] } ]); ``` Required for @happyvertical/ocr LiteLLM provider vision-based OCR.

github-actions · 2025-12-19T02:05:02Z

📦 Version Bump Preview

When this PR is merged, packages will receive a minor
version bump based on your conventional commits.

What happens on merge?

Tests run on main branch
Packages are built
Versions are bumped automatically
Packages are published to GitHub Packages
Git tags are created

No manual intervention needed! 🎉

willgriffin merged commit 85dac4b into main Dec 19, 2025
6 checks passed

willgriffin deleted the feat/ai-vision-support branch December 19, 2025 02:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ai): add vision support for multimodal messages#683

feat(ai): add vision support for multimodal messages#683
willgriffin merged 1 commit intomainfrom
feat/ai-vision-support

willgriffin commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willgriffin commented Dec 19, 2025

Summary

Usage

Motivation

Test plan

Uh oh!

github-actions bot commented Dec 19, 2025

📦 Version Bump Preview

What happens on merge?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant