Skip to content

Attachments

adham90 edited this page Feb 16, 2026 · 5 revisions

Attachments

Send images, PDFs, and other files to vision-capable models using the with: option.

Basic Usage

Single File

class VisionAgent < ApplicationAgent
  model "gpt-4o"  # Vision-capable model
  param :question, required: true
  user "{question}"
end

# Local file
VisionAgent.call(question: "Describe this image", with: "photo.jpg")

# URL
VisionAgent.call(
  question: "What architecture is shown?",
  with: "https://example.com/building.jpg"
)

Multiple Files

VisionAgent.call(
  question: "Compare these screenshots",
  with: ["screenshot_v1.png", "screenshot_v2.png"]
)

Supported File Types

RubyLLM automatically detects file types:

Category Extensions
Images .jpg, .jpeg, .png, .gif, .webp, .bmp
Videos .mp4, .mov, .avi, .webm
Audio .mp3, .wav, .m4a, .ogg, .flac
Documents .pdf, .txt, .md, .csv, .json, .xml
Code .rb, .py, .js, .ts, .html, .css, and more

Vision-Capable Models

Not all models support vision. Use these:

Provider Models
OpenAI gpt-4o, gpt-4o-mini, gpt-4-turbo
Anthropic claude-3-5-sonnet, claude-3-opus, claude-3-haiku
Google gemini-2.0-flash, gemini-1.5-pro

Image Analysis Examples

Describe an Image

class ImageDescriber < ApplicationAgent
  model "gpt-4o"
  param :detail_level, default: "medium"
  user "Describe this image in {detail_level} detail."
end

result = ImageDescriber.call(
  detail_level: "high",
  with: "product_photo.jpg"
)

Extract Text (OCR)

class OCRAgent < ApplicationAgent
  model "gpt-4o"

  user do
    <<~S
      Extract all text from this image.
      Preserve the original formatting and structure.
      Return the text exactly as it appears.
    S
  end

  def schema
    @schema ||= RubyLLM::Schema.create do
      string :extracted_text, description: "All text found in image"
      array :text_blocks, of: :object do
        string :content
        string :location, description: "top/middle/bottom"
      end
    end
  end
end

result = OCRAgent.call(with: "document_scan.png")
puts result[:extracted_text]

Compare Images

class ImageComparator < ApplicationAgent
  model "claude-3-5-sonnet"

  user do
    <<~S
      Compare these two images and identify:
      1. Similarities
      2. Differences
      3. Which appears higher quality
    S
  end

  def schema
    @schema ||= RubyLLM::Schema.create do
      array :similarities, of: :string
      array :differences, of: :string
      string :quality_winner, enum: ["first", "second", "equal"]
      string :explanation
    end
  end
end

result = ImageComparator.call(with: ["design_v1.png", "design_v2.png"])

Document Analysis

PDF Analysis

class PDFAnalyzer < ApplicationAgent
  model "gpt-4o"
  param :focus_area, default: "summary"

  user do
    <<~S
      Analyze this PDF document. Focus on: {focus_area}

      Provide:
      - Main topics covered
      - Key points
      - Any important figures or data
    S
  end
end

result = PDFAnalyzer.call(
  focus_area: "financial data",
  with: "annual_report.pdf"
)

Invoice Processing

class InvoiceExtractor < ApplicationAgent
  model "gpt-4o"
  user "Extract invoice details from this document."

  def schema
    @schema ||= RubyLLM::Schema.create do
      string :invoice_number
      string :date
      string :vendor_name
      number :total_amount
      string :currency, default: "USD"
      array :line_items, of: :object do
        string :description
        integer :quantity
        number :unit_price
        number :total
      end
    end
  end
end

result = InvoiceExtractor.call(with: "invoice.pdf")
# => { invoice_number: "INV-2024-001", total_amount: 1250.00, ... }

URLs vs Local Files

Local Files

# Relative path (from Rails root)
result = VisionAgent.call(with: "storage/images/photo.jpg")

# Absolute path
result = VisionAgent.call(with: "/path/to/photo.jpg")

# Active Storage
result = VisionAgent.call(with: user.avatar.blob.path)

URLs

# Direct image URL
result = VisionAgent.call(with: "https://example.com/image.jpg")

# S3 signed URL
url = document.file.url(expires_in: 1.hour)
result = VisionAgent.call(with: url)

Debug Mode

result = VisionAgent.call(
  question: "test",
  with: ["image1.png", "image2.png"],
  dry_run: true
)

# => {
#   dry_run: true,
#   agent: "VisionAgent",
#   attachments: ["image1.png", "image2.png"],
#   ...
# }

Error Handling

begin
  result = VisionAgent.call(
    question: "Describe this",
    with: "missing_file.jpg"
  )
rescue Errno::ENOENT
  # File not found
  Rails.logger.error("Attachment file not found")
rescue => e
  # Other errors (network, invalid format, etc.)
  Rails.logger.error("Attachment error: #{e.message}")
end

Best Practices

Optimize Image Size

Large images increase cost and latency:

# Resize before sending
image = MiniMagick::Image.open("large_photo.jpg")
image.resize "1024x1024>"
image.write "optimized_photo.jpg"

result = VisionAgent.call(with: "optimized_photo.jpg")

Use Appropriate Detail Level

Some providers support detail levels:

# OpenAI specific - in your prompt
user "Using high detail analysis, describe every element in this image."

Batch Related Images

Group related images in a single call:

# One call with multiple images (cheaper than multiple calls)
result = CompareAgent.call(
  with: ["before.jpg", "after.jpg"]
)

Handle Large Documents

For large PDFs, consider chunking:

class LargeDocumentAgent < ApplicationAgent
  model "gpt-4o"
  timeout 180  # Longer timeout for large docs
  user "Analyze this document page by page. Focus on key information."
end

Related Pages

Clone this wiki locally