Skip to content

Python: [Feature]: Allow @tool functions to return image content that the model can analyze #4272

@droideronline

Description

@droideronline

Description

Currently, @tool-decorated functions are text-in / text-out by design. When a tool returns a Content object (e.g. Content.from_data(image_bytes, "image/png")), FunctionTool.parse_result() serializes it to a JSON string via _make_dumpable() and stores it as plain text in Content.from_function_result(result=...). When sent to the API, it becomes {"type": "function_call_output", "output": "<json string>"} — so the model sees a text description of the image, not the actual visual content.

This creates a gap for agentic use-cases where a tool needs to capture a screenshot, render a chart, fetch an image from an external service, or generate a diagram, and the model should then be able to visually analyze or describe that image as part of the same reasoning loop — without requiring the developer to manually wire up a multi-turn workaround.

What problem does it solve?

  • Enables vision-in-the-loop workflows without manual multi-turn orchestration
  • Allows tools like capture_screenshot(), render_chart(), or fetch_image() to feed image content back into the model natively
  • Unblocks agentic computer-use and data-visualization scenarios where the image is produced dynamically at runtime (not known ahead of time as a user message)

Expected behavior

When a @tool function returns a Content object with an image media type (or a list[Content] containing image content), the framework should forward that content as a visual input item to the next model call — not as a serialized JSON string — so the model can perceive and reason about the image.

Possible implementation approaches:

  1. A dedicated content type (e.g. "function_result_with_content") that carries structured Content items alongside the text result, allowing provider-specific serialization to emit both function_call_output and input_image items
  2. A new @tool option (e.g. return_content=True) that opts the function into rich-content return handling
  3. Middleware/hook support that intercepts tool results containing image Content and injects them as user-turn vision messages automatically

Alternatives considered

  • Multi-turn workaround: Tool returns image bytes → developer extracts them from the function_result → injects into the next Message as Content.from_data(...). Works but requires boilerplate and breaks the natural tool abstraction.
  • MCP tools: MCP server tools that return ImageContent go through _parse_tool_result_from_mcp which also serializes to JSON text — same limitation.
  • Provider-managed tools (ImageGenTool, code interpreter): These work but only cover built-in provider capabilities, not custom user-defined functions.

Code Sample

Current workaround (manual multi-turn):

@tool(approval_mode="never_require")
async def capture_screenshot(url: Annotated[str, Field(description="URL to screenshot")]) -> str:
    image_bytes = await take_screenshot(url)
    # Must return text; store image out-of-band
    store_image("last_screenshot", image_bytes)
    return "Screenshot captured."

# Then manually inject the image into the next turn:
response = await client.get_response(messages, tools=capture_screenshot)
image_bytes = retrieve_image("last_screenshot")
messages.append(Message(role="user", contents=[
    Content.from_text("Now analyze this screenshot:"),
    Content.from_data(data=image_bytes, media_type="image/png"),
]))
response2 = await client.get_response(messages)

Desired (proposed) API:

@tool(approval_mode="never_require")
async def capture_screenshot(url: Annotated[str, Field(description="URL to screenshot")]) -> Content:
    image_bytes = await take_screenshot(url)
    return Content.from_data(data=image_bytes, media_type="image/png")

# Framework automatically forwards the image content to the next model call
response = await client.get_response(messages, tools=capture_screenshot)
print(response.text)  # Model's description/analysis of the screenshot

Language/SDK

Both

Metadata

Metadata

Assignees

Labels

pythonv1.0Features being tracked for the version 1.0 GA

Type

No type

Projects

Status

In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions