-
-
Notifications
You must be signed in to change notification settings - Fork 692
Open
Description
The Omniparser agent loop incorrectly uses a single mapping table (derived from only the most recent screenshot) to convert all historical computer_call messages to function_call format. This ignores UI changes between different screenshots.
Impact
- Inaccurate historical context for LLM
- Wrong element IDs assigned to past interactions
- Potential task execution failures due to incorrect context
- Debugging difficulties
Code References
/cua/libs/python/agent/agent/loops/omniparser.py
Only processes latest screenshot:
# In predict_step() - line 318-320
last_computer_call_output = get_last_computer_call_output(messages)
if last_computer_call_output:
image_url = last_computer_call_output.get("output", {}).get("image_url", "")
# Only processes this single screenshotUses single mapping for all messages:
# Line 340-352
xy2id = {v: k for k, v in id2xy.items()} # Single mapping from latest screenshot
messages_with_element_ids = []
for i, message in enumerate(messages):
# ...
converted = await replace_computer_call_with_function(message, xy2id) # Same mapping for allMetadata
Metadata
Assignees
Labels
No labels