This repo provides the extended environments for CoMLRL.
This repository contains the writing-task experiments in [AAAI26] LLM Collaboration with Multi‑Agent Reinforcement Learning.
Install CoMLRL:
pip install comlrl
# Install PyTorch compatible with your deviceOr via conda-forge:
conda install -c conda-forge comlrl
# Install PyTorch compatible with your device- ArXiv Abstract Expansion:
LovelyBuggies/arXiv_abstract(train[:1000], val[:1000]) - TLDR Summarization:
trl-lib/tldr(train[:1000], test[:1000])
python LLM_Collab_Writing/train_grpo.py \
--config LLM_Collab_Writing/configs/grpo_arxiv_config.yaml
python LLM_Collab_Writing/train_magrpo.py \
--config LLM_Collab_Writing/configs/magrpo_tldr_config.yamlOverride any configuration value inline with --override:
python LLM_Collab_Writing/train_magrpo.py \
--config LLM_Collab_Writing/configs/magrpo_arxiv_config.yaml \
--override model.name='Qwen/Qwen3-7B' magrpo.learning_rate=3e-6Writing runs are strictly single-turn. Both training entrypoints enforce num_turns=1; configs that specify other values will raise an error.
- ArXiv: Agent 1 writes background/motivation; Agent 2 writes methodology/implications.
- TLDR: Agent 1 produces a concise summary; Agent 2 expands with additional details and vocabulary diversity.
- GRPO mode: A single agent emits both paragraphs separated by
[PARAGRAPH_SPLIT], which the reward splits internally.
Rewards reuse the level-based metrics from the paper:
- Structural token limits.
- Relative length coordination.
- Vocabulary diversity (unique word ratios).
- Style mix (transition-word coverage + Jaccard overlap).
The same functions back evaluation loggers for the baselines.
Evaluation wrappers adapt the original logging utilities to the unified MAGRPOTrainer API, yielding aggregated metrics such as token ratios, transition coverage, and gated vs. ungated rewards. Weights & Biases configs mirror the code-generation project; set wandb.project, wandb.entity, and wandb.name in YAML or via overrides.
