Skip to content

Comments

On policy distillation#1458

Merged
faresobeid merged 37 commits intomainfrom
on-policy-distill
Dec 31, 2025
Merged

On policy distillation#1458
faresobeid merged 37 commits intomainfrom
on-policy-distill

Conversation

@faresobeid
Copy link
Contributor

@faresobeid faresobeid commented Dec 19, 2025

Can use by setting this in an rl.toml config

teacher_gpu_ids = ...

[teacher_inference.model]
name = ...

[trainer.loss]
adv_tau = ...
teacher_tau = ...

Can also use an existing teacher inference instant with

[orchestrator.teacher_model.client]
base_url = ["http://localhost:8001/v1"]

[orchestrator.teacher_model.model]
name = ...

Note

Adds end-to-end support for on-policy distillation and an option to skip verification for pure distillation.

  • Teacher model pipeline: New orchestrator.teacher_model config; RL runner can auto-start a teacher inference server via teacher_gpu_ids/teacher_inference; orchestrator computes teacher_logprobs via prefill and attaches to TrainingSample/MicroBatch
  • Loss integration: Trainer loss now supports teacher_logprobs with trainer.loss.teacher_tau and trainer.loss.adv_tau; logs teacher_kl; minor fixes to importance ratio/sequence handling
  • Buffer control: orchestrator.buffer.skip_verification disables rubric scoring and related features (rewards set to 0)
  • Batch/transport updates: Carry temperature explicitly and include optional teacher_logprobs through packing/padding to GPU tensors
  • Docs and validation: New docs/on_policy_distillation.md; config validators to auto-configure teacher server/client and prevent invalid combos; tests updated

Written by Cursor Bugbot for commit 9b6383c. This will update automatically on new commits. Configure here.

@faresobeid faresobeid marked this pull request as draft December 19, 2025 15:13
@willccbb
Copy link
Member

Verifiers PR for skipping rubric scoring: PrimeIntellect-ai/verifiers#645

@willccbb willccbb self-assigned this Dec 20, 2025
@faresobeid faresobeid marked this pull request as ready for review December 22, 2025 11:55
faresobeid and others added 4 commits December 26, 2025 14:54
@faresobeid
Copy link
Contributor Author

image Purple is with a bit of teacher_kl. reverse-text is a special env as rewards are already non-binary so already dense to some extent

Copy link
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm great work

@faresobeid faresobeid merged commit 1a5841b into main Dec 31, 2025
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants