Reinforcement Learning via Self-Distillation (SDPO)

📖 Introduction • 📊 Main Results • 🚀 Getting Started

📖 Introduction

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as Reinforcement Learning with Rich Feedback (RLRF):

We propose Self-Distilled Policy Optimization (SDPO), a reinforcement learning framework that augments on-policy optimization with self-distillation from the model’s own high-reward trajectories.

SDPO converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context.

📊 Main Results

Learning without Rich Environment Feedback

When environment feedback is sparse or rule-based, standard reinforcement learning methods struggle to propagate learning signals efficiently. SDPO addresses this by reusing high-reward rollouts as implicit feedback, providing dense supervision even in the absence of rich environment feedback.

Training progression of Olmo3-7B-Instruct on Chemistry. We report the average accuracy across 16 samples per question and a rolling average of response lengths over 5 steps. We report GRPO with the optimal hyperparameters for this model and task.

Comparison of SDPO and GRPO on reasoning-related benchmarks. We report the highest achieved avg@16 within 1 hour and 5 hours of wall-clock training time, respectively. Both SDPO and on-policy GRPO perform one gradient step per generation batch, while GRPO performs 4 off-policy mini batch steps. We select optimal hyperparameters for SDPO and baselines based on 5h accuracy. Each run is performed on a node with 4 NVIDIA GH200 GPUs. Together with initialization and validation, each run takes approximately 6 hours.

Learning with Rich Environment Feedback

In settings where environments provide structured or textual feedback, SDPO naturally incorporates this information into self-distillation. By conditioning future attempts on both successful demonstrations and feedback from failed attempts, SDPO achieves faster convergence and more stable training.

SDPO with rich environment feedback. Left: SDPO benefits from denser credit assignment (logit > token > sequence-level) and consistently outperforms GRPO when rich feedback is available. Right: The self-teacher improves throughout training, and the final student substantially surpasses the initial teacher. Error bars show variability across seeds.

Solving Hard Questions via Test-Time Self-Distillation

SDPO also enables test-time self-distillation. By generating multiple candidate solutions, identifying high-quality responses, and reusing them as demonstrations, the model can iteratively refine its outputs at inference time. This leads to substantial gains on hard reasoning tasks without additional training.

Test-time self-distillation on hard coding problems. SDPO solves questions that neither the base model nor multi-turn interaction can solve, achieving higher solution discovery rates across generation budgets.

🚀 Getting Started

System Requirements

Operating System: Linux (Tested on SLES 15 SP5 and Ubuntu 22.04)
Hardware: NVIDIA GPUs (CUDA compatible)
Python: 3.12 (Tested on 3.12.3)
CUDA Driver: Compatible with the PyTorch version installed (see below).

Installation

Option 1: Docker (Recommended for HPC/GH200 Clusters)

For NVIDIA GH200 (aarch64) clusters with CUDA 13.1, we provide a pre-configured Dockerfile based on the NGC vLLM container.

Build and deploy:

# Build the image
podman build . -f Dockerfile.gh200 -t sdpo-gh200

# Export for cluster use (enroot/squashfs)
enroot import -x mount -o sdpo-gh200.sqsh podman://localhost/sdpo-gh200:latest

Note

The Docker images use requirements-gh200.txt which contains pinned versions from requirements-full.txt, excluding packages pre-installed in the NGC vLLM container (torch, vllm, flash-attn, xformers, triton).

Option 2: Local Installation

Install PyTorch:

For Ampere/Hopper (RTX 30/40, H100):

pip install torch==2.5.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

For Blackwell (RTX 50, RTX PRO 2000 Blackwell):

pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Install SDPO and Dependencies:

# Install core dependencies (pinned versions)
pip install -r requirements.txt

# Install SDPO (verl) in editable mode
pip install -e .

# Install Flash Attention 2 (compiled from source)
pip install flash-attn --no-build-isolation

Optional: Install SGLang/vLLM for high-throughput inference:

pip install -r requirements_sglang.txt

Requirement Files

File	Description
`requirements.txt`	Core dependencies with pinned versions
`requirements-gh200.txt`	For NGC vLLM container (excludes pre-installed packages)
`requirements-full.txt`	Complete pip freeze from working environment
`requirements_sglang.txt`	SGLang/vLLM stack for local inference
`requirements-cuda.txt`	Flash Attention (for non-Docker installs)

vLLM Version Note:

# vllm==0.8.4       # GH200 cluster
# vllm>=0.12.0      # Blackwell (RTX 50 series, B100/B200) - NOT FULLY TESTED

Warning

Blackwell architecture support (RTX 50 series, B100/B200) has not been fully tested.

Tip

For reproducibility, use requirements-full.txt which contains the exact versions from a tested environment.

Note

For more specific instructions on verl architecture and advanced configuration, refer to the official verl repository.

Data Preparation

Data Loading

The detailed instructions for loading the data are provided in data/README.md.

One example is provided below:

python data/load_dataset.py \
    --dataset_name Chemistry \
    --output_path datasets/chemistry.json

To split the data into train and test sets, run the following command:

python data/split_tasks.py \
    --json_path datasets/chemistry.json \
    --output_dir datasets/sciknoweval/chemistry \
    --test_ratio 0.1 \
    --seed 42

For LiveCodeBenchv6 split the unit tests into train and test sets, run the following command:

python data/split_tests.py \
    --json_path datasets/lcb_v6.json \
    --output_dir datasets/lcb_v6

Data Preprocessing

Our implementation uses the parquet format for the data. To preprocess the data, run the following command:

python data/preprocess.py \
    --data_source DATASET_PATH

DATASET_PATH should contain the train.json and test.json files.

Configuration

Before running experiments, adapt the paths in verl/trainer/config/user.yaml to your environment:

vars:
  dir: /path/to/your/SDPO              # Path to the SDPO repository
  log_dir: /path/to/your/logs          # Directory for logs
  ckpt_dir: /path/to/your/checkpoints  # Directory for model checkpoints

Training

Reproducing Results (Without Rich Environment Feedback)

Run the following commands to reproduce the results without rich environment feedback.

GRPO baseline:

bash experiments/generalization/run_baseline_grpo_all.sh

SDPO:

bash experiments/generalization/run_sdpo_all.sh

Reproducing Results (With Rich Environment Feedback)

Run the following commands to reproduce the results with rich environment feedback.

GRPO baseline:

bash experiments/rich_feedback/run_baseline_grpo.sh

SDPO:

bash experiments/rich_feedback/run_sdpo.sh

Multi-turn Baseline of Section 5

Prepare the data by splitting it into individual tasks:

export MY_DATA_SPLITS_DIR=lcb_v6
export MY_DATA_SINGLES_DIR=lcb_v6_singles
bash dat/prepare_data_splits.sh datasets/lcb_v6.json

Run the multi-turn baseline for, e.g., question 120:

python baseline_multiturn/multiturn.py --data-dir=lcb_v6_singles/q_120 --run-name multiturn_q120

Or, for all hard questions:

bash experiments/ttt/run_multiturn_all.sh

Usage Documentation

This section documents the configuration options added by SDPO on top of the base verl framework.

Policy Loss Configuration

Located at actor.policy_loss in the config.

loss_mode (str, default: "vanilla"): Loss function mode. Set to "sdpo" to enable self-distillation. Options: vanilla, sdpo.

Self-Distillation Configuration

Located at actor.self_distillation in the config. Only active when actor.policy_loss.loss_mode = "sdpo".

Core Settings

full_logit_distillation (bool, default: True): Whether to use full-logit KL distillation.
alpha (float, default: 0.5): KL interpolation coefficient. 0.0 = forward KL, 1.0 = reverse KL, 0.5 = JSD.
success_reward_threshold (float, default: 1.0): Minimum sequence reward to be considered a successful demonstration.
teacher_regularization (str, default: "ema"): Teacher regularization mode. Options: ema, trust-region. Note: if ema is used, the model on the RefWorker is updated as an exponential moving average. trust-region requires use_fused_kernels = False.
teacher_update_rate (float, default: 0.05): EMA update rate for teacher weights, or trust-region mixing coefficient.
distillation_topk (int | None, default: 100): If set, use top-k logits for distillation instead of full distribution.
distillation_add_tail (bool, default: True): Whether to add a tail bucket for top-k distillation.
is_clip (float | None, default: 2.0): Clip value for importance sampling ratio. None disables IS weighting.

Reprompting Settings

max_reprompt_len (int, default: 10240): Maximum token length of the reprompted prompt.
reprompt_truncation (str, default: "right"): Truncation method for reprompted prompts. Options: left, right, error.
dont_reprompt_on_self_success (bool, default: True): If True, don't use a sample's own successful response as demonstration.
remove_thinking_from_demonstration (bool, default: True): Whether to remove <think>...</think> tags from demonstrations.

Template Settings

reprompt_template (str): Main template for reprompting. Placeholders: {prompt}, {solution}, {feedback}.
solution_template (str): Template for the solution section. Placeholder: {successful_previous_attempt}.
feedback_template (str): Template for the feedback section. Placeholder: {feedback_raw}.

Feedback Settings

include_environment_feedback (bool, default: True): Whether to include environment feedback (e.g., test errors) in reprompting.
environment_feedback_only_without_solution (bool, default: True): If True, only use feedback when no successful solution is available.

Citation

If you find this work helpful, please cite us.

@article{hubotter2026reinforcement,
  title = {Reinforcement Learning via Self-Distillation},
  author = {Hübotter, Jonas and Lübeck, Frederike and Behric, Lejs and Baumann, Anton and Bagatella, Marco and Marta, Daniel and Hakimi, Ido and Shenfeld, Idan and Kleine Buening, Thomas and Guestrin, Carlos and Krause, Andreas},
  year = {2026},
  journal = {arXiv preprint arXiv:2601.20802},
}

Attribution

Our implementation is based on a recent version of verl.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
baseline_multiturn		baseline_multiturn
data		data
datasets/tooluse		datasets/tooluse
docker		docker
docs		docs
examples		examples
experiments		experiments
figures		figures
scripts		scripts
tests		tests
training		training
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
Dockerfile		Dockerfile
Dockerfile.gh200		Dockerfile.gh200
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-full.txt		requirements-full.txt
requirements-gh200.txt		requirements-gh200.txt
requirements-npu.txt		requirements-npu.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
run_local_grpo.sh		run_local_grpo.sh
run_local_sdpo.sh		run_local_sdpo.sh
run_local_test.sh		run_local_test.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reinforcement Learning via Self-Distillation (SDPO)

📖 Introduction

📊 Main Results

Learning without Rich Environment Feedback

Learning with Rich Environment Feedback

Solving Hard Questions via Test-Time Self-Distillation

🚀 Getting Started

System Requirements

Installation

Option 1: Docker (Recommended for HPC/GH200 Clusters)

Option 2: Local Installation

Requirement Files

Data Preparation

Data Loading

Data Preprocessing

Configuration

Training

Reproducing Results (Without Rich Environment Feedback)

Reproducing Results (With Rich Environment Feedback)

Multi-turn Baseline of Section 5

Usage Documentation

Policy Loss Configuration

Self-Distillation Configuration

Core Settings

Reprompting Settings

Template Settings

Feedback Settings

Citation

Attribution

About

Uh oh!

Contributors 3

Uh oh!

Languages

License

lasgroup/SDPO

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning via Self-Distillation (SDPO)

📖 Introduction

📊 Main Results

Learning without Rich Environment Feedback

Learning with Rich Environment Feedback

Solving Hard Questions via Test-Time Self-Distillation

🚀 Getting Started

System Requirements

Installation

Option 1: Docker (Recommended for HPC/GH200 Clusters)

Option 2: Local Installation

Requirement Files

Data Preparation

Data Loading

Data Preprocessing

Configuration

Training

Reproducing Results (Without Rich Environment Feedback)

Reproducing Results (With Rich Environment Feedback)

Multi-turn Baseline of Section 5

Usage Documentation

Policy Loss Configuration

Self-Distillation Configuration

Core Settings

Reprompting Settings

Template Settings

Feedback Settings

Citation

Attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages