Remove the hard-coded code-level-logging at each node, since we don't expect users to inspect the details during training, at the cost of huge VRAM usage.
Change the default value of hyperparameters according to the Dr. GRPO style, and learning rate to be 2e-5 and no more bandit external mode since it is equivalent to the magrpo in single-turn.
Optimize the code format and group the closed params together.
Add MBPP dataset.

Plain fails 2; expert fails 1; level feedback not fail yet.

Assets 2

29 Sep 12:14

LovelyBuggies

v1.1.2

5f8f1a3

v1.1.2

Changelog

This version adds normalized advantage and epsilon-clip support, which pairs with CoMLRL 1.1.2.

Normalizing the advantage can make convergence stable (no significant) at the cost of slightly higher VRAM use. Clipping too strictly will hurt.

Gray - Repeated Bandit
Light Green - Plain
Red - Level Feedback
Blue - Expert Edits

Assets 2

26 Sep 15:56

LovelyBuggies

v1.1.1

f9ec5ec

v1.1.1

Changelogs

Pair with CoMLRL v1.1.1.

Same as V1.0.0 that enables using return to optimize LLMs
This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
This version supports early termination with a threshold.

Aligned Joint by Default

If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),

joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
~~joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram~~

Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.

Termination

magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.

At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.

Assets 2

26 Sep 01:23

LovelyBuggies

v1.1.0

cd060e2

v1.1.0

~~This version makes the update based on return rather than rewards.~~
~~Worked with CoMLRL v1.1.0~~

But the cross-joint mode takes such a long time to train, so this version should be deprecated by v1.1.1

Assets 2

24 Sep 23:02

LovelyBuggies

v1.0.6

9a8608d

v1.0.6

Changelog

Allow the reward to be negative by having a shift processor.
Don't forget to also set early termination threshold to allow early termination.

Work with comlrl v1.0.5.

Assets 2

24 Sep 17:08

LovelyBuggies

v1.0.5

9331ca2

v1.0.5

CHANGELOG:

Fix the random handoff. And allow wandb to log more configs like handoff, expert.mode ...
Also make the train larger, so the dataset is just split into train and eval/test.

Work with CoMLRL v1.0.4

Best Handoff and Expert is Helpful

Assets 2

Releases: OpenMLRL/LLM_Collab_Code_Generation

v1.1.7

Uh oh!

v1.1.6

Uh oh!

v1.1.5

Uh oh!

v1.1.4

Uh oh!

v1.1.3

Changelog

Uh oh!

v1.1.2

Changelog

Uh oh!

v1.1.1

Changelogs

Aligned Joint by Default

Termination

Uh oh!

v1.1.0

Uh oh!

v1.0.6

Changelog

Uh oh!

v1.0.5

CHANGELOG:

Best Handoff and Expert is Helpful

Uh oh!