📄 Paper • 🌐 Website • 🎮 Demo • 📖 Documentation • 👥 About Us
This repo supports:
- ✅ Single-agent(SA) RL training
- ✅ Multi-agent RL training (one role-sharing policy)
- ✅ Multi-agent RL training (role-specialized policies using different lora adaptor or different LLMs)
- 
[2025.10] 🚀 GitHub repository open-sourced and publicly available 
- 
[2025.10] 🎉 Paper released! Check out our arxiv preprint 
- 
[2025.10] 🔥 Support for different LoRA adapters per agent role - enabling efficient role-specialized training 
- 
[2025.09] 🌍 Multi-environment support added: Game (Sudoku, Sokoban), Code (APPS, CodeContests), and Math (AIME, OlympiadBench) 
- 
[2025.08] 🤖 Multi-agent framework implementation: support for both shared single model and role-specific models 
- Multi-Level Agent Specialization: Train and specialize agents at any level, from lightweight prompt adjustments to full model fine-tuning with LoRA or reinforcement learning.
- Novel RL Algorithm: Implements Agent- and turn wise GRPO- AT-GRPO for efficient and stable multi-agent training.
- Built-in Multi-Turn MAS Workflows: Comes with predefined, reproducible benchmarks and environments for a variety of domains:
- 🎮 Games: Sudoku (4x4), Sokoban (6x6)
- 📐 Planning: Plan-Path (10x10 grid)
- 💻 Coding: APPS, CodeContests, LiveCodeBench
- 🔢 Math: AIME24/25, OlympiadBench
 
- More Environments: Verilog design, web search, robotics, database query, scientific discovery
- Multi-Modal Support: Vision-language models, audio processing, mixed-modal tasks
- Agentic Framework Integration: AutoGen, LangGraph, CrewAI, and custom framework APIs
| Method | Acc.(%) | Δ | 
|---|---|---|
| Single agent | 5.00 | – | 
| Training tool agent in SA, eval in SA | 11.00 | +6.00 | 
| Training code agent in SA, eval in SA | 14.50 | +9.50 | 
| Training in SA, eval in MAS | 16.00 | +11.00 | 
| MAS RL (role specific policies), eval in MAS | 96.00 | +91.00 | 
| w/ Swapped Policies | 6.00 | +1.00 | 
git clone https://github.com/pettingllms-ai/PettingLLMs.git
cd PettingLLMs
bash setup.bashPrepare datasets for different tasks:
# Code tasks (APPS, CodeContests, LiveCodeBench)
python scripts/dataprocess/load_code.py
# Math tasks (AIME24/25, OlympiadBench)
python scripts/dataprocess/load_math.py
# Game/Planning tasks (Sokoban, Sudoku)
python scripts/dataprocess/load_sokoban.pyDatasets will be saved to datasets/code/, datasets/math/, and datasets/sudoku_environments/.
Example: Train multi-agent system on math tasks
bash scripts/train/math/math_L1_prompt.shOther training scripts available in scripts/train/:
- code_single_policy.sh,- code_two_policy.sh- Code domain
- plan_path_single.sh,- plan_path_two_policy.sh- Planning domain
- sokoban_two_policy.sh,- sokodu_single.sh- Game domain
Example: Evaluate trained model
Edit scripts/evaluate/evaluate.sh to set your model path and config:
MODEL_PATHS=("/path/to/your/model")
CONFIG_NAME="math_single_policy"Then run:
bash scripts/evaluate/evaluate.shOf course, here is a more concise version focusing on how agent roles are differentiated at each level.
PettingLLMs uses a tiered approach to define agent roles, ranging from simple instructions to deep model specialization.
| Level | Role Specialization Method | Description | 
|---|---|---|
| L0 | Shared model | Roles are defined solely through instructions in the prompt. The base model is identical for all agents, offering a flexible but performance-limited baseline. | 
| L1 | Role-specific LoRA | Each role is specialized using a unique, lightweight LoRA adapter. This creates distinct, cost-effective agent "personalities" on top of a shared base model. | 
| L2 | Role-specific Model | The entire model's weights are optimized for a specific role using reinforcement learning. This creates a highly specialized expert agent for maximum performance on complex tasks. | 
If you find PettingLLMs useful for your research or projects, please cite:
@article{zhao2025stronger,
  title={Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs},
  author={Zhao, Yujie and Hu, Lanxiang and Wang, Yang and Hou, Minmin and Zhang, Hao and Ding, Ke and Zhao, Jishen},
  journal={arXiv preprint arXiv:2510.11062},
  year={2025}
}This work was primarily conducted by Yujie Zhao during her summer internship at Intel Corporation. We gratefully acknowledge Intel's support and resources that made this research possible.
- VERL: VERL: Efficient RL Training for LLMs - For efficient distributed RL training infrastructure
- RLLM: RLLM: Reinforcement Learning with Language Models - For foundational RL algorithms for LLMs
Released under the MIT license. See LICENSE for details.

