-
Notifications
You must be signed in to change notification settings - Fork 4
Attachments
adham90 edited this page Feb 16, 2026
·
5 revisions
Send images, PDFs, and other files to vision-capable models using the with: option.
class VisionAgent < ApplicationAgent
model "gpt-4o" # Vision-capable model
param :question, required: true
user "{question}"
end
# Local file
VisionAgent.call(question: "Describe this image", with: "photo.jpg")
# URL
VisionAgent.call(
question: "What architecture is shown?",
with: "https://example.com/building.jpg"
)VisionAgent.call(
question: "Compare these screenshots",
with: ["screenshot_v1.png", "screenshot_v2.png"]
)RubyLLM automatically detects file types:
| Category | Extensions |
|---|---|
| Images |
.jpg, .jpeg, .png, .gif, .webp, .bmp
|
| Videos |
.mp4, .mov, .avi, .webm
|
| Audio |
.mp3, .wav, .m4a, .ogg, .flac
|
| Documents |
.pdf, .txt, .md, .csv, .json, .xml
|
| Code |
.rb, .py, .js, .ts, .html, .css, and more |
Not all models support vision. Use these:
| Provider | Models |
|---|---|
| OpenAI |
gpt-4o, gpt-4o-mini, gpt-4-turbo
|
| Anthropic |
claude-3-5-sonnet, claude-3-opus, claude-3-haiku
|
gemini-2.0-flash, gemini-1.5-pro
|
class ImageDescriber < ApplicationAgent
model "gpt-4o"
param :detail_level, default: "medium"
user "Describe this image in {detail_level} detail."
end
result = ImageDescriber.call(
detail_level: "high",
with: "product_photo.jpg"
)class OCRAgent < ApplicationAgent
model "gpt-4o"
user do
<<~S
Extract all text from this image.
Preserve the original formatting and structure.
Return the text exactly as it appears.
S
end
def schema
@schema ||= RubyLLM::Schema.create do
string :extracted_text, description: "All text found in image"
array :text_blocks, of: :object do
string :content
string :location, description: "top/middle/bottom"
end
end
end
end
result = OCRAgent.call(with: "document_scan.png")
puts result[:extracted_text]class ImageComparator < ApplicationAgent
model "claude-3-5-sonnet"
user do
<<~S
Compare these two images and identify:
1. Similarities
2. Differences
3. Which appears higher quality
S
end
def schema
@schema ||= RubyLLM::Schema.create do
array :similarities, of: :string
array :differences, of: :string
string :quality_winner, enum: ["first", "second", "equal"]
string :explanation
end
end
end
result = ImageComparator.call(with: ["design_v1.png", "design_v2.png"])class PDFAnalyzer < ApplicationAgent
model "gpt-4o"
param :focus_area, default: "summary"
user do
<<~S
Analyze this PDF document. Focus on: {focus_area}
Provide:
- Main topics covered
- Key points
- Any important figures or data
S
end
end
result = PDFAnalyzer.call(
focus_area: "financial data",
with: "annual_report.pdf"
)class InvoiceExtractor < ApplicationAgent
model "gpt-4o"
user "Extract invoice details from this document."
def schema
@schema ||= RubyLLM::Schema.create do
string :invoice_number
string :date
string :vendor_name
number :total_amount
string :currency, default: "USD"
array :line_items, of: :object do
string :description
integer :quantity
number :unit_price
number :total
end
end
end
end
result = InvoiceExtractor.call(with: "invoice.pdf")
# => { invoice_number: "INV-2024-001", total_amount: 1250.00, ... }# Relative path (from Rails root)
result = VisionAgent.call(with: "storage/images/photo.jpg")
# Absolute path
result = VisionAgent.call(with: "/path/to/photo.jpg")
# Active Storage
result = VisionAgent.call(with: user.avatar.blob.path)# Direct image URL
result = VisionAgent.call(with: "https://example.com/image.jpg")
# S3 signed URL
url = document.file.url(expires_in: 1.hour)
result = VisionAgent.call(with: url)result = VisionAgent.call(
question: "test",
with: ["image1.png", "image2.png"],
dry_run: true
)
# => {
# dry_run: true,
# agent: "VisionAgent",
# attachments: ["image1.png", "image2.png"],
# ...
# }begin
result = VisionAgent.call(
question: "Describe this",
with: "missing_file.jpg"
)
rescue Errno::ENOENT
# File not found
Rails.logger.error("Attachment file not found")
rescue => e
# Other errors (network, invalid format, etc.)
Rails.logger.error("Attachment error: #{e.message}")
endLarge images increase cost and latency:
# Resize before sending
image = MiniMagick::Image.open("large_photo.jpg")
image.resize "1024x1024>"
image.write "optimized_photo.jpg"
result = VisionAgent.call(with: "optimized_photo.jpg")Some providers support detail levels:
# OpenAI specific - in your prompt
user "Using high detail analysis, describe every element in this image."Group related images in a single call:
# One call with multiple images (cheaper than multiple calls)
result = CompareAgent.call(
with: ["before.jpg", "after.jpg"]
)For large PDFs, consider chunking:
class LargeDocumentAgent < ApplicationAgent
model "gpt-4o"
timeout 180 # Longer timeout for large docs
user "Analyze this document page by page. Focus on key information."
end