Skip to content

v1.1.1

Choose a tag to compare

@LovelyBuggies LovelyBuggies released this 26 Sep 15:56
· 31 commits to main since this release

Changelogs

Pair with CoMLRL v1.1.1.

  1. Same as V1.0.0 that enables using return to optimize LLMs
  2. This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
  3. This version supports early termination with a threshold.

Aligned Joint by Default

If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),

  • joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
  • joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
  • joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram

Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.
image

Termination

magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.

At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.