Release v1.1.1 · OpenMLRL/LLM_Collab_Code_Generation

Changelogs

Same as V1.0.0 that enables using return to optimize LLMs
This version fixes the extremely long time for training in cross-joint mode and changes the default hyperparameter accordingly.
This version supports early termination with a threshold.

Aligned Joint by Default

If using the cross joint, there would be K^{TN} samples per turn (if no early termination), which is very slow. Given 2 agents train on H100 (H200 3 times faster),

joint_mode=align, num_turns=2, num_generation=3, 9 + 3 = 12 samples in MC tree, it takes 10 hours and 76G vram, expect_return=-1.9
joint_mode=cross, num_turns=2, num_generations=3, 81 + 9 = 90 samples in MC tree, it takes 23 hours and 86G vram, expect_return=-1.2
~~joint_mode=cross, num_turns=3, num_generations=2, 64 + 32 + 8 = 104 samples in MC tree, estimated to take 60+ hours and 80G vram~~

Cross (cyan) can learn faster than align (orange) with more accurate value estimation with VRAM almost the same 75-90G, but the training takes much longer time.

Termination

magrpo.termination_threshold is used to incentive agents to find high-reward solutions quickly, instead of expanding the full Monte Carlo tree.

At each node (branch, turn), compute the mean immediate reward across the sibling joint actions at that node. If the mean exceeds the threshold, that branch stops expanding at this turn; training backpropagates from the truncated subtree. Other branches continue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.1.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Changelogs

Aligned Joint by Default

Termination

Uh oh!