v1.1.1
Changelogs
Pair with CoMLRL v1.1.1.
- Same as V1.0.0 that enables using return to optimize LLMs
- This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
- This version supports early termination with a threshold.
Aligned Joint by Default
If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),
- joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
- joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram
Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.

Termination
magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.
At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.