Skip to content

Releases: OpenMLRL/LLM_Collab_Code_Generation

v1.1.7

31 Oct 16:20

Choose a tag to compare

Remove the memory mode in 1.1.6, just using the full history.

60% succ (12/20).

v1.1.6

29 Oct 13:58
e95e44f

Choose a tag to compare

Update 3 memory modes. Should work together with CoMLRL's v1.1.6.

image

v1.1.5

15 Oct 21:12

Choose a tag to compare

Ryan fixed the joint mode for single-agent case, and I changed the readme

v1.1.4

05 Oct 14:47

Choose a tag to compare

Clean up unused variables in v1.1.3. Code logic has not changed.

Screenshot 2025-10-06 at 12 20 35

v1.1.3

05 Oct 00:46
d775bec

Choose a tag to compare

This version should work with CoMLRL v1.1.3.

Changelog

  1. Remove the hard-coded code-level-logging at each node, since we don't expect users to inspect the details during training, at the cost of huge VRAM usage.
  2. Change the default value of hyperparameters according to the Dr. GRPO style, and learning rate to be 2e-5 and no more bandit external mode since it is equivalent to the magrpo in single-turn.
  3. Optimize the code format and group the closed params together.
  4. Add MBPP dataset.
image Plain fails 2; expert fails 1; level feedback not fail yet.

v1.1.2

29 Sep 12:14

Choose a tag to compare

Changelog

This version adds normalized advantage and epsilon-clip support, which pairs with CoMLRL 1.1.2.

Normalizing the advantage can make convergence stable (no significant) at the cost of slightly higher VRAM use. Clipping too strictly will hurt.

image
  • Gray - Repeated Bandit
  • Light Green - Plain
  • Red - Level Feedback
  • Blue - Expert Edits

v1.1.1

26 Sep 15:56

Choose a tag to compare

Changelogs

Pair with CoMLRL v1.1.1.

  1. Same as V1.0.0 that enables using return to optimize LLMs
  2. This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
  3. This version supports early termination with a threshold.

Aligned Joint by Default

If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),

  • joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
  • joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
  • joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram

Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.
image

Termination

magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.

At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.

v1.1.0

26 Sep 01:23
cd060e2

Choose a tag to compare

This version makes the update based on return rather than rewards.
Worked with CoMLRL v1.1.0

But the cross-joint mode takes such a long time to train, so this version should be deprecated by v1.1.1

v1.0.6

24 Sep 23:02
9a8608d

Choose a tag to compare

Changelog

  • Allow the reward to be negative by having a shift processor.
  • Don't forget to also set early termination threshold to allow early termination.

Work with comlrl v1.0.5.

image

v1.0.5

24 Sep 17:08
9331ca2

Choose a tag to compare

CHANGELOG:

  • Fix the random handoff. And allow wandb to log more configs like handoff, expert.mode ...
  • Also make the train larger, so the dataset is just split into train and eval/test.

Work with CoMLRL v1.0.4

Best Handoff and Expert is Helpful

image image image