On policy distillation by faresobeid · Pull Request #1458 · PrimeIntellect-ai/prime-rl

faresobeid · 2025-12-19T15:13:51Z

Can use by setting this in an rl.toml config

teacher_gpu_ids = ...

[teacher_inference.model]
name = ...

[trainer.loss]
adv_tau = ...
teacher_tau = ...

Can also use an existing teacher inference instant with

[orchestrator.teacher_model.client]
base_url = ["http://localhost:8001/v1"]

[orchestrator.teacher_model.model]
name = ...

Note

Adds end-to-end support for on-policy distillation and an option to skip verification for pure distillation.

Teacher model pipeline: New orchestrator.teacher_model config; RL runner can auto-start a teacher inference server via teacher_gpu_ids/teacher_inference; orchestrator computes teacher_logprobs via prefill and attaches to TrainingSample/MicroBatch
Loss integration: Trainer loss now supports teacher_logprobs with trainer.loss.teacher_tau and trainer.loss.adv_tau; logs teacher_kl; minor fixes to importance ratio/sequence handling
Buffer control: orchestrator.buffer.skip_verification disables rubric scoring and related features (rewards set to 0)
Batch/transport updates: Carry temperature explicitly and include optional teacher_logprobs through packing/padding to GPU tensors
Docs and validation: New docs/on_policy_distillation.md; config validators to auto-configure teacher server/client and prevent invalid combos; tests updated

^{Written by Cursor Bugbot for commit 9b6383c. This will update automatically on new commits. Configure here.}

src/prime_rl/orchestrator/orchestrator.py

src/prime_rl/orchestrator/config.py

src/prime_rl/trainer/batch.py

src/prime_rl/orchestrator/teacher.py

willccbb · 2025-12-20T04:16:56Z

Verifiers PR for skipping rubric scoring: PrimeIntellect-ai/verifiers#645

CHANGELOG.md

src/prime_rl/trainer/batch.py

src/prime_rl/rl.py

src/prime_rl/orchestrator/utils.py

src/prime_rl/orchestrator/orchestrator.py

src/prime_rl/orchestrator/trajectories.py

tests/unit/train/rl/test_loss.py

src/prime_rl/orchestrator/utils.py

src/prime_rl/trainer/rl/loss.py

tests/unit/orchestrator/test_batch.py

This reverts commit ebe4fb1.

src/prime_rl/orchestrator/config.py

src/prime_rl/trainer/rl/loss.py

Signed-off-by: faresobeid <111092724+faresobeid@users.noreply.github.com>

src/prime_rl/trainer/rl/data.py

src/prime_rl/rl.py

faresobeid · 2025-12-29T17:43:09Z

Purple is with a bit of teacher_kl. reverse-text is a special env as rewards are already non-binary so already dense to some extent

src/prime_rl/orchestrator/config.py

src/prime_rl/orchestrator/orchestrator.py

src/prime_rl/trainer/rl/data.py

src/prime_rl/rl.py

samsja

lgtm great work

faresobeid marked this pull request as draft December 19, 2025 15:13

cursor bot reviewed Dec 19, 2025

View reviewed changes

src/prime_rl/orchestrator/orchestrator.py Outdated Show resolved Hide resolved

src/prime_rl/orchestrator/config.py Outdated Show resolved Hide resolved

src/prime_rl/trainer/batch.py Show resolved Hide resolved

src/prime_rl/orchestrator/teacher.py Outdated Show resolved Hide resolved

willccbb self-assigned this Dec 20, 2025

faresobeid marked this pull request as ready for review December 22, 2025 11:55