Skip to content

feat(ai): add vision support for multimodal messages#683

Merged
willgriffin merged 1 commit intomainfrom
feat/ai-vision-support
Dec 19, 2025
Merged

feat(ai): add vision support for multimodal messages#683
willgriffin merged 1 commit intomainfrom
feat/ai-vision-support

Conversation

@willgriffin
Copy link
Contributor

Summary

Add support for vision-capable LLMs by extending AIMessage.content to accept both simple strings and multimodal content arrays (ContentPart[]).

Changes:

  • Add ContentPart types (TextContentPart, ImageContentPart) for multimodal messages
  • Add extractTextContent() helper function for backward compatibility
  • Update AIMessage.content type to accept string | ContentPart[]
  • Update OpenAI provider to properly map ContentPart arrays to OpenAI format
  • Update all other providers (Anthropic, Bedrock, Claude-CLI, Gemini, HuggingFace) to use extractTextContent() for backward compatibility

Usage

This enables vision-capable models to accept images inline:

import { getAI } from '@happyvertical/ai';

const ai = await getAI({ type: 'openai' });

const response = await ai.chat([
  {
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { 
        type: 'image_url', 
        image_url: { 
          url: 'data:image/png;base64,iVBORw0KGgo...', 
          detail: 'high' 
        } 
      }
    ]
  }
]);

Motivation

Required for @happyvertical/ocr LiteLLM provider which uses vision-capable LLMs for OCR text extraction.

See: happyvertical/ocr#21

Test plan

  • All existing tests pass (71 tests, 10 skipped for API keys)
  • TypeScript compilation passes
  • Manual testing with vision-capable model (GPT-4o, DeepSeek-VL)

Add support for vision-capable LLMs by extending AIMessage.content to accept
both simple strings and multimodal content arrays (ContentPart[]).

Changes:
- Add ContentPart types (TextContentPart, ImageContentPart) for multimodal messages
- Add extractTextContent() helper function for backward compatibility
- Update AIMessage.content type to accept string | ContentPart[]
- Update OpenAI provider to properly map ContentPart arrays to OpenAI format
- Update all other providers (Anthropic, Bedrock, Claude-CLI, Gemini, HuggingFace)
  to use extractTextContent() for backward compatibility

This enables usage like:
```typescript
await ai.chat([
  {
    role: 'user',
    content: [
      { type: 'text', text: 'What is in this image?' },
      { type: 'image_url', image_url: { url: 'data:image/png;base64,...' } }
    ]
  }
]);
```

Required for @happyvertical/ocr LiteLLM provider vision-based OCR.
@github-actions
Copy link
Contributor

📦 Version Bump Preview

When this PR is merged, packages will receive a minor
version bump based on your conventional commits.

What happens on merge?

  1. Tests run on main branch
  2. Packages are built
  3. Versions are bumped automatically
  4. Packages are published to GitHub Packages
  5. Git tags are created

No manual intervention needed! 🎉

@willgriffin willgriffin merged commit 85dac4b into main Dec 19, 2025
6 checks passed
@willgriffin willgriffin deleted the feat/ai-vision-support branch December 19, 2025 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant