This repo provides the extended environments for CoMLRL.
This repository contains the code-generation experiments in [AAAI26] LLM Collaboration with Multi‑Agent Reinforcement Learning.
Install CoMLRL:
pip install comlrl
# Install PyTorch compatible with your deviceOr via conda-forge:
conda install -c conda-forge comlrl
# Install PyTorch compatible with your device- MBPP: 427 problems on split
sanitized - HumanEval: 164 problems on split
test - CoopHumanEval: 82 problems on split
test
python LLM_Collab_Code_Generation/train_grpo.py \
--config LLM_Collab_Code_Generation/configs/grpo_he_config.yaml
python LLM_Collab_Code_Generation/train_magrpo.py \
--config LLM_Collab_Code_Generation/configs/magrpo_che_config.yamlOverride any configuration value inline with --override:
python LLM_Collab_Code_Generation/train_magrpo.py \
--config LLM_Collab_Code_Generation/configs/magrpo_he_config.yaml \
--override model.name='bigcode/starcoder2-3b' magrpo.num_turns=1magrpo.joint_mode controls how each agent’s magrpo.num_generations = G samples combine into joint actions. align (default) pairs the cross forms the Cartesian product, yielding
Advantages use the mean return of the sibling set at each node as the baseline, matching the Dr. GRPO update. No standard-deviation normalization, importance sampling ratios, or epsilon clipping are applied because training is strictly on-policy.
magrpo.num_turns controls tree depth and magrpo.num_generations controls branching factor. Under align, each node has cross, each node has
magrpo.termination_threshold compares the mean immediate reward of the current sibling set against the threshold. When the threshold is exceeded, the branch stops expanding at that turn and the trainer backpropagates from the truncated subtree while other branches continue exploring.
Each agent receives its own full history at every turn: all of its past prompts and responses plus the new external prompt/context. No cross-agent transcripts are injected; coordination happens only through the shared reward.
external.mode shapes the prompt that seeds each turn:
level_feedback(default) appends syntax/test diagnostics and asks for a code-only revision.external.sandbox_slicecontrols how many eval asserts to surface (1 by default, 0/None/'all'to show every test, negatives to index from the end).expert_editscalls an external LLM (configured viaexternal.expert_edits_model, default DeepSeek-Coder) to produce compact edit suggestions before asking for a revision.plainsimply restates the prior attempt and asks for a concise "revise, code-only" answer.
rewards/code_rewards.py implements the aux/main collaboration reward in three levels:
- Function definitions: +0.4 when the auxiliary function is well defined with a return value, +0.6 when the entry-point function is defined correctly. Missing the main definition halts scoring.
- Syntax: Both functions are concatenated with any imports extracted from the prompt; a clean syntax check yields +0.5 and allows execution.
- Execution: Prompt-derived tests (10 s timeout per assert, up to three timeouts) run inside a sandbox. Passing assertions earn up to +1.0, with +0.5 bonus when the main function uses the aux function, +1.0 when the main is not a thin wrapper, and −0.5 if the aux return value is ignored. These same rewards back evaluation during logging.
loggers/ adapt the shared MAGRPOTrainer utilities to emit pass rates, syntax success, reward decomposition, termination statistics, and aux-usage diagnostics. Configure Weights & Biases by setting wandb.project, wandb.entity, and wandb.name in YAML or through overrides. output.save_model stays false by default to avoid storing large checkpoints, while output.verbose toggles the detailed per-episode traces used when debugging on a cluster.
