Task-oriented OS-based benchmark for vision-based desktop-oriented GUI-navigation agents.
- May 6, 2025: Our paper is published on arXiv.
- Install Docker – you need it unless you use your own agent runner (see below).
Clone the repository:
git clone https://github.com/agentsea/osuniverse.git
cd osuniverseExport keys:
export GEMINI_API_KEY=your_api_key # for the validator
export OPENAI_API_KEY=your_api_key # for the agent based on GPT-4oInstall dependencies:
cd agents/react
poetry install
cd ../..
poetry install # to lock the versions of the dependenciesCheck the list of tests to run:
python benchmark.py --agent-yaml agents/react/agent.yaml --agent-model gpt-4o --results results/run-001 --dry-runSpecify additional parameters (categories, levels, max steps):
python benchmark.py --agent-yaml agents/react/agent.yaml --agent-model gpt-4o --results results/run-001 --categories desktop,terminal --levels paper,wood --max-steps 10,30 --dry-runRun the benchmark with the specified parameters with 4 workers:
python benchmark.py --agent-yaml agents/react/agent.yaml --agent-model gpt-4o --results-dir results/run-001 --categories desktop,terminal --levels paper,wood --max-steps 10,30 --runners 4View the results (you can run this command as soon as the benchmark is started and observe the progress as the tests are being run):
streamlit run viewer.py -- --dir ./results/run-001- To continue the benchmark from the previous run, run the same command: only tests that were not run before will be run.
- To rerun the failed tests, use
--mode rerun-failedflag. - To validate the test cases without running them (given that the test cases were already run), use
--mode validate-onlyflag.
Defaults:
--agent-yamlisagents/react/agent.yaml--agent-modelisgpt-4o--categoriesis none (all categories); possible values:desktop,browser,gym,terminal,libreoffice_calc,libreoffice_writer,multiapp; should be separated by commas.--levelsis none (all levels); possible values:paper,wood,bronze,silver,gold; should be separated by commas.--max-stepsis none; default values in config are 5,25,50,100,200; if there are several values separated by commas, the values are applied to levels from paper to gold, with the last value being applied to all remaining levels; for example,--max-steps 10,20,30will set 10 steps for paper, 20 steps for wood, 30 steps for bronze, silver, and gold.--testcases-diristestcases--results-dirisresults--runnersis 1--dry-runis false (use it to see the list of test cases without running them).--modeisrun-all(possible values:run-all,rerun-failed,validate-only).run-allmode will run all the test cases that have not been run yet.rerun-failedmode will rerun the test cases that have been run and failed, and run the test cases that have not been run yet.validate-onlymode will only validate the test cases that have been run.
--agent-yaml agents/cua/agent.yaml:- the model is locked to
computer-use-preview-2025-03-11withOPENAI_API_KEY - Note: This agent is tailored for the Computer Use Preview model and is not compatible with other models.
- the model is locked to
--agent-yaml agents/claude_computer_use/agent.yaml:--agent-model claude-3-5-sonnet-20241022withANTHROPIC_API_KEY- Note: This agent is tailored for the Claude 3.5 Sonnet model and is not compatible with other models.
--agent-yaml agents/qwen/agent.yaml:--agent-model qwen2.5-vl-72b-instructwithDASHSCOPE_API_KEY--agent-model qwen2.5-vl-7b-instructwithDASHSCOPE_API_KEY- Note: This agent is tailored for the Qwen 2.5 VL models family and is not compatible with other models.
--agent-yaml agents/react/agent.yaml:--agent-model gpt-4o-2024-11-20withOPENAI_API_KEY--agent-model claude-3-5-sonnet-20241022withANTHROPIC_API_KEY--agent-model gemini/gemini-1.5-pro-002withGEMINI_API_KEY- Note: This agent is tailored for any major VLM through LiteLLM API. Feel free to use it with other models, as long as LiteLLM is compatible with them, and they are capable of generating valid JSONs.
Use the Viewer to see the results of the benchmark and to run the human evaluation (optionally).
streamlit run viewer.py -- --dir ./results/v001helper.py contains a set of helper functions that you can use alongside the benchmark.
python helper.py distributionwill show the distribution of the tests across the categories and levels.python helper.py cleanupwill clean up the stuck surfkit devices and trackers (this is useful if you run the benchmark with several runners and some of them get stuck).
You can run benchmark on some Surfkit-based agents. See the list of compatible agents below. The agents themselves should be compatible with surfkit==0.1.396.
- Clone the agent's repo alongside the
osuniverserepo. - Run
poetry installin the agent's directory. - Export the API keys the agent needs (e.g.
OPENAI_API_KEY,ANTHROPIC_API_KEY,GEMINI_API_KEY, etc). - Run
poetry installin theosuniversedirectory. - Run
export GEMINI_API_KEY=your_api_keyin theosuniversedirectory. - If you want to allow passing a model to your agent, you should read it inside the agent via
os.environ['SURFKIT_AGENT_MODEL']. It's up to you to define the logic of what to do with this parameter. Also, you may want to pass the--agent-model-base-urlparameter to the benchmark. Inside the agent, you can read it viaos.environ['SURFKIT_AGENT_MODEL_BASE_URL']. - Run
python benchmark.py --agent-yaml <path_to_agent_yaml> --agent-model <agent_model> --categories desktop --levels paper --results ./results/<your_run_name> --max-steps 10in theosuniversedirectory.
Both Surfkit and Taskara are beng developed rapidly. Because of this, we have to pin the versions of these two core dependencies, the specific docker images we use, and the specific commits or branches of the agents. Therefore:
- Surfkit is pinned to
0.1.396in thepyproject.tomlfile. - Taskara is pinned to
0.1.228in thepyproject.tomlfile. - AgentDesk is pinned to
0.2.120in thepyproject.tomlfile. - The docker image for the Desktop is pinned to
us-docker.pkg.dev/agentsea-dev/agentd/desktop-webtop:8ed7f4ein each testcase. - The Taskara docker image is pinned to
us-central1-docker.pkg.dev/agentsea-dev/taskara/api:884e381inrunners/surfkit_agent_runner.py.
You are welcome to extend the benchmark with new test cases. See the testcases directory for examples. Also, please refer to our paper for more information about the test cases classification.
You can extend the benchmark by writing your own agents. See the Writing Custom Surfkit-based Agents guide for more information.
You can also extend the benchmark by writing your own runner. See the Writing Custom Runners guide for more information.
Occasionally, agents get stuck. If you run the benchmark with several runners, some of them may get stuck due to limited resources or conflicting ports. Because of this, we recommend stopping the benchmark (Ctrl+C) and running it again. If you run the same command again, only the tests that were not run before will be run.
If you use this workaround, please don't forget to clean up the stuck surfkit devices and trackers by running python helper.py cleanup or removing them manually in the Docker Desktop application.
If you use OSUniverse in your research, please cite our paper:
@misc{davydova2025osuniversebenchmarkmultimodalguinavigation,
title={OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents},
author={Mariya Davydova and Daniel Jeffries and Patrick Barker and Arturo Márquez Flores and Sinéad Ryan},
year={2025},
eprint={2505.03570},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.03570},
}