-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
Currently, @tool-decorated functions are text-in / text-out by design. When a tool returns a Content object (e.g. Content.from_data(image_bytes, "image/png")), FunctionTool.parse_result() serializes it to a JSON string via _make_dumpable() and stores it as plain text in Content.from_function_result(result=...). When sent to the API, it becomes {"type": "function_call_output", "output": "<json string>"} — so the model sees a text description of the image, not the actual visual content.
This creates a gap for agentic use-cases where a tool needs to capture a screenshot, render a chart, fetch an image from an external service, or generate a diagram, and the model should then be able to visually analyze or describe that image as part of the same reasoning loop — without requiring the developer to manually wire up a multi-turn workaround.
What problem does it solve?
- Enables vision-in-the-loop workflows without manual multi-turn orchestration
- Allows tools like
capture_screenshot(),render_chart(), orfetch_image()to feed image content back into the model natively - Unblocks agentic computer-use and data-visualization scenarios where the image is produced dynamically at runtime (not known ahead of time as a user message)
Expected behavior
When a @tool function returns a Content object with an image media type (or a list[Content] containing image content), the framework should forward that content as a visual input item to the next model call — not as a serialized JSON string — so the model can perceive and reason about the image.
Possible implementation approaches:
- A dedicated content type (e.g.
"function_result_with_content") that carries structuredContentitems alongside the text result, allowing provider-specific serialization to emit bothfunction_call_outputandinput_imageitems - A new
@tooloption (e.g.return_content=True) that opts the function into rich-content return handling - Middleware/hook support that intercepts tool results containing image
Contentand injects them as user-turn vision messages automatically
Alternatives considered
- Multi-turn workaround: Tool returns image bytes → developer extracts them from the
function_result→ injects into the nextMessageasContent.from_data(...). Works but requires boilerplate and breaks the natural tool abstraction. - MCP tools: MCP server tools that return
ImageContentgo through_parse_tool_result_from_mcpwhich also serializes to JSON text — same limitation. - Provider-managed tools (
ImageGenTool, code interpreter): These work but only cover built-in provider capabilities, not custom user-defined functions.
Code Sample
Current workaround (manual multi-turn):
@tool(approval_mode="never_require")
async def capture_screenshot(url: Annotated[str, Field(description="URL to screenshot")]) -> str:
image_bytes = await take_screenshot(url)
# Must return text; store image out-of-band
store_image("last_screenshot", image_bytes)
return "Screenshot captured."
# Then manually inject the image into the next turn:
response = await client.get_response(messages, tools=capture_screenshot)
image_bytes = retrieve_image("last_screenshot")
messages.append(Message(role="user", contents=[
Content.from_text("Now analyze this screenshot:"),
Content.from_data(data=image_bytes, media_type="image/png"),
]))
response2 = await client.get_response(messages)Desired (proposed) API:
@tool(approval_mode="never_require")
async def capture_screenshot(url: Annotated[str, Field(description="URL to screenshot")]) -> Content:
image_bytes = await take_screenshot(url)
return Content.from_data(data=image_bytes, media_type="image/png")
# Framework automatically forwards the image content to the next model call
response = await client.get_response(messages, tools=capture_screenshot)
print(response.text) # Model's description/analysis of the screenshotLanguage/SDK
Both
Metadata
Metadata
Assignees
Labels
Type
Projects
Status