- [October 28, 2025]: π₯ We are honored to be featured as Hugging Face Daily Paper #1.
- [October 27, 2025]: π Our paper is now available on arXiv and Hugging Face.
- [October 27, 2025]: π Our codebase released. You can now deploy DeepAgent with reasoning models like QwQ, Qwen3 and your own toolsets.
rapidapi.mp4
DeepAgent is a reasoning agent with scalable toolsets, capable of tackling general tasks by searching for and using the appropriate tools from over 16,000 RapidAPIs in an end-to-end agentic reasoning process. (Note: Due to some APIs in ToolBench being unavailable, API responses are LLM-simulated in this demo to show the system's normal functionality.)
alfworld.mp4
DeepAgent also excels at navigation-based tasks (e.g., web browsing, OS interaction, and embodied AI) by using a versatile set of pluggable actions such as moving, looking, and taking.
deep_research.mp4
DeepAgent can also serve as a powerful research assistant, equipped with specialized tools for web search, browsing, code execution, visual QA, and file processing.
DeepAgent is an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. This paradigm shifts away from traditional, predefined workflows (e.g., ReAct's "Reason-Act-Observe" cycle), allowing the agent to maintain a global perspective on the entire task and dynamically discover tools on an as-needed basis.
To handle long-horizon interactions and prevent getting stuck in incorrect exploration paths, we introduce an Autonomous Memory Folding mechanism. This allows DeepAgent to "take a breath" by compressing its interaction history into a structured, brain-inspired memory schema, enabling it to reconsider its strategy and proceed efficiently.
Furthermore, we propose ToolPO, an end-to-end reinforcement learning (RL) training method tailored for general tool use, which enhances the agent's proficiency in mastering these complex mechanisms.
We conduct extensive experiments on a wide range of benchmarks:
- (1) General Tool-Use Tasks: We evaluate DeepAgent on ToolBench, API-Bank, TMDB, Spotify, and ToolHop, which feature toolsets scaling from tens to over ten thousand distinct tools.
- (2) Downstream Applications: We test its performance on ALFWorld, WebShop, GAIA, and Humanity's Last Exam (HLE), which require the use of domain-specific toolsets. The overall results in Figure show that DeepAgent achieves superior performance across all scenarios.
-
Unified Agentic Reasoning: DeepAgent departs from rigid, predefined workflows. It operates in a single stream of thought, autonomously reasoning about the task, dynamically discovering necessary tools, and executing actions. This allows the LRM to maintain a global perspective and unlock its full autonomous potential.
-
Autonomous Memory Folding & Brain-Inspired Memory: When facing complex problems, DeepAgent can autonomously trigger memory folding. This process consolidates the interaction history into a structured memory, allowing the agent to restart its reasoning with a condensed yet comprehensive understanding of its progress. The memory architecture is brain-inspired and consists of:
- Episodic Memory: A high-level log of key events, decisions, and sub-task completions.
- Working Memory: Contains the most recent information, including the current sub-goal and near-term plans.
- Tool Memory: Consolidates tool-related interactions, allowing the agent to learn from experience and refine its strategies.
-
End-to-End RL Training with ToolPO: To effectively train the agent, we introduce ToolPO, a policy optimization method featuring:
- An LLM-based Tool Simulator that mimics real-world APIs, ensuring stable and efficient training.
- Tool-Call Advantage Attribution, which assigns fine-grained credit to correct tool invocation tokens, providing a more precise learning signal.
# Create conda environment
conda create -n deepagent python=3.10
conda activate deepagent
# Install requirements
cd DeepAgent-main
pip install -r requirements.txtThe benchmarks we utilize are categorized into several types:
- General Tool Use Benchmarks:
- ToolBench: Features 16,000+ real-world RapidAPIs requiring multi-step, multi-tool reasoning.
- API-Bank: Evaluates planning, retrieval, and calling with 73 APIs across 314 human-annotated dialogues.
- RestBench: Simulates REST API applications with TMDB (54 tools) and Spotify (40 tools) scenarios.
- ToolHop: Tests multi-hop reasoning across 3,912 locally executable tools requiring 3-7 sequential calls.
- Embodied Agent Benchmarks:
- ALFWorld: Text-based embodied AI environment where agents complete household tasks using 9 basic actions.
- Web Navigation Benchmarks:
- WebShop: Online shopping simulation requiring agents to search and navigate products to fulfill user requirements.
- Deep Research Benchmarks:
- GAIA: Complex information-seeking tasks requiring web search, browsing, VQA, code execution, and file processing.
- Humanity's Last Exam (HLE): Extremely challenging reasoning problems testing advanced capabilities with code, search, and VQA tools. For efficient testing, we sampled 500 questions from the full set with 2,500 questions.
All the pre-processed data can be found in the ./data/ directory, except for ToolBench which needs to be downloaded from ToolBench's official repository, as it is too large to be included in our repository.
Before running DeepAgent, ensure your reasoning model and auxiliary model are served using vLLM. DeepAgent is designed to work with powerful reasoning models as the main agent and can use an auxiliary model for tasks like memory generation and tool selection. For more details, please refer to [vLLM](https://github.com/vllm-project/vllm).
For the main reasoning model, we recommend using the following models. Performance improves from top to bottom, but computational cost also increases accordingly. You can choose a cost-effective model based on your needs:
| Model | Size | Type | Link |
|---|---|---|---|
| Qwen3-4B-Thinking | 4B | Thinking | π€ HuggingFace |
| Qwen3-8B | 8B | Hybrid | π€ HuggingFace |
| Qwen3-30B-A3B-Thinking | 30B | Thinking | π€ HuggingFace |
| QwQ-32B | 32B | Thinking | π€ HuggingFace |
| Qwen3-235B-A22B-Thinking | 235B | Thinking | π€ HuggingFace |
For the auxiliary model, we recommend using the Qwen2.5-Instruct or Qwen3-Instruct series models with similar parameters to the main reasoning model, but without thinking capabilities for faster inference.
All configurations are in ./config/base_config.yaml, including API keys, service URLs and paths. You need to modify them to your actual configurations:
Choose your task and configure the corresponding APIs:
- ToolBench (RapidAPI):
toolbench_api: RapidAPI key used in ToolBench. You can get it from ToolBench's official repository.toolbench_service_url: ToolBench service URL. Keep it as default to use ToolBench's official service.
- Deep Research:
- RestBench (TMDB & Spotify):
- WebShop:
webshop_service_url: WebShop service URL. You can create a new environment and serve it locally following the instructions in WebShop's official repository.
Configure your model endpoints in the config file:
-
Main Reasoning LLM:
model_name: The name of your served reasoning model (e.g.,QwQ-32B).base_url: API endpoint for your reasoning model service (e.g.,http://0.0.0.0:8080/v1).api_key: API key for accessing the reasoning model service. Set toemptyif you are using vLLM.tokenizer_path: Local path to the tokenizer files for the reasoning model.
-
Auxiliary LLM:
aux_model_name: The name of your served auxiliary model (e.g.,Qwen2.5-32B-Instruct).aux_base_url: API endpoint for the auxiliary model service.aux_api_key: API key for the auxiliary model. Set toemptyif you are using vLLM.aux_tokenizer_path: Local path to the tokenizer files for the auxiliary model.
-
VQA Model (for GAIA & HLE with image input):
vqa_model_name: The name of your served vision-language model (e.g.,Qwen2.5-VL-32B-Instruct). Model serving method is here.vqa_base_url: API endpoint for the VQA model service.vqa_api_key: API key for the VQA model. Set toemptyif you are using vLLM.
-
Tool Retriever:
tool_retriever_model_path: Local path to the tool retriever model (e.g.,./models/bge-large-en-v1.5).tool_retriever_api_base: API endpoint for the tool retriever service. Pre-serving it can avoid reloading the retriever model every time you run the system. You can deploy it using the following command:
python src/run_tool_search_server.py \ --base_config_path ./config/base_config.yaml \ --datasets toolbench,toolhop,tmdb,spotify,api_bank \ --host 0.0.0.0 \ --port 8001
To run on a benchmark dataset with tool search enabled, use the following command:
python src/run_deep_agent.py \
--config_path ./config/base_config.yaml \
--dataset_name toolbench \
--enable_tool_search \
--evalTo run on a benchmark dataset with closed-set mode, use the following command:
python src/run_deep_agent.py \
--config_path ./config/base_config.yaml \
--dataset_name gaia \
--evalParameters Explanation:
--config_path: Path to the main configuration file.--dataset_name: Name of the dataset to use (e.g.,toolbench,api_bank,tmdb,spotify,toolhop,gaia,hle,alfworld,webshop).--subset_num: Number of samples to run from the dataset.--concurrent_limit: Maximum number of concurrent requests. Default is 32.--enable_tool_search: Allows the agent to search for tools. If disabled, it will only use the tools provided for the task (closed-set).--enable_thought_folding: Allows the agent to use the thought folding mechanism.--max_action_limit: Maximum number of actions (tool search and tool call) per question.--max_fold_limit: Maximum number of thought folds per question.--top_k: Maximum number of search tools to return.--eval: Run evaluation on the results after generation.
Our model inference script can automatically save the model's input and output for evaluation. To run the evaluation, use the --eval flag when running ./src/run_deep_agent.py. The evaluation scripts for each dataset are located in ./src/evaluate/.
Welcome to try our deep research agent series:
DeepAgent: A General Reasoning Agent with Scalable Toolsets (New!)
TLDR: An end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution with brain-inspired memory folding mechanism.
![]()
![]()
![]()
![]()
Agentic Entropy-Balanced Policy Optimization
TLDR: An agentic RL algorithm designed to balance entropy in both the rollout and policy update phases.
![]()
![]()
![]()
![]()
Agentic Reinforced Policy Optimization
TLDR: An agentic RL algorithm encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds,
![]()
![]()
![]()
![]()
Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search
TLDR: This framework hierarchically decouples deep search into strategic planning and domain-specific execution by specialized agents.
![]()
![]()
![]()
![]()
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
TLDR: An end-to-end TIR post-training framework that empowers LLMs to autonomously interact with multi-tool environments through Self-Critic RL design
![]()
![]()
![]()
![]()
WebThinker: Empowering Large Reasoning Models with Deep Research Capability (NeurIPS 2025)
TLDR: A deep research agent that empowers large reasoning models with autonomous search, web browsing, and research report drafting capabilities.
![]()
![]()
![]()
![]()
Search-o1: Agentic Search-Enhanced Large Reasoning Models (EMNLP 2025)
TLDR: An agentic search-enhanced framework that integrates autonomous knowledge retrieval with large reasoning models through Agentic RAG and reasoning-in-documents modules.
![]()
![]()
![]()
![]()
If you find this work helpful, please cite our paper:
@misc{deepagent,
title={DeepAgent: A General Reasoning Agent with Scalable Toolsets},
author={Xiaoxi Li and Wenxiang Jiao and Jiarui Jin and Guanting Dong and Jiajie Jin and Yinuo Wang and Hao Wang and Yutao Zhu and Ji-Rong Wen and Yuan Lu and Zhicheng Dou},
year={2025},
eprint={2510.21618},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.21618},
}This project is released under the MIT License.
For any questions or feedback, please reach out to us at xiaoxi_li@ruc.edu.cn.


