Releases: OpenMLRL/LLM_Collab_Code_Generation
v1.1.7
v1.1.6
v1.1.5
Ryan fixed the joint mode for single-agent case, and I changed the readme
v1.1.4
v1.1.3
This version should work with CoMLRL v1.1.3.
Changelog
- Remove the hard-coded code-level-logging at each node, since we don't expect users to inspect the details during training, at the cost of huge VRAM usage.
- Change the default value of hyperparameters according to the Dr. GRPO style, and learning rate to be 2e-5 and no more bandit external mode since it is equivalent to the magrpo in single-turn.
- Optimize the code format and group the closed params together.
- Add MBPP dataset.
Plain fails 2; expert fails 1; level feedback not fail yet.v1.1.2
Changelog
This version adds normalized advantage and epsilon-clip support, which pairs with CoMLRL 1.1.2.
Normalizing the advantage can make convergence stable (no significant) at the cost of slightly higher VRAM use. Clipping too strictly will hurt.
- Gray - Repeated Bandit
- Light Green - Plain
- Red - Level Feedback
- Blue - Expert Edits
v1.1.1
Changelogs
Pair with CoMLRL v1.1.1.
- Same as V1.0.0 that enables using return to optimize LLMs
- This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
- This version supports early termination with a threshold.
Aligned Joint by Default
If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),
- joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
- joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram
Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.

Termination
magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.
At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.
v1.1.0
This version makes the update based on return rather than rewards.
Worked with CoMLRL v1.1.0
But the cross-joint mode takes such a long time to train, so this version should be deprecated by v1.1.1
v1.0.6
Changelog
- Allow the reward to be negative by having a shift processor.
- Don't forget to also set early termination threshold to allow early termination.
Work with comlrl v1.0.5.

v1.0.5
CHANGELOG:
- Fix the random handoff. And allow wandb to log more configs like handoff, expert.mode ...
- Also make the train larger, so the dataset is just split into train and eval/test.
Work with CoMLRL v1.0.4
Best Handoff and Expert is Helpful


