Implement pass rate-based curriculum learning with weighted sampling #153

jb3618columbia · 2026-01-24T08:15:15Z

Summary

Implements curriculum learning using pass rate-based weighted sampling for GRPO training

Changes

Add a PassRateTracker class to track attempt and success counts for each prompt. This tracker can be used for multiple curriculum samplers
Add a PassRateWeightedSampler class which implements a weighted sampler that adjusts sampling probabilities (probability of prompts sampled in a batch) based on historical pass rates (optional to use exponential moving average)
Update DAPOTrainer: pass rate tracker during training and logs curriculum metrics
Minor edits to make the integration work

Testing

Tested with local single-node runs
Tested with multi-node SLURM runs (2 nodes, 8 GPUs each)
Logs curriculum metrics: hardest_10pct/25pct/50pct/75pct pass rates, batch-level statistics
See curriculum learning runs: https://wandb.ai/mbzuai-llm/Reasoning360/runs/qab27nv0?nw=nwuserjalajbhandari
Example run: Curriculum-1435219-qwen2.5-32b-base-fsdp-temp_0.5_data_mixtures_round2_train_prompt_bsz_32

Effective batch size increases with training, decreases when hard samples get sampled and then increases as the model learns to solve hard problems

Pass rates of hard examples increase with training: model focuses on harder problems and starts to solve these (tracking the percentile of hard problems based on historical pass rates)

Right skew to the count distribution which shows the some prompts are only attempted a few times (easy prompts) while other samples are attempted multiple times (hard prompts)

Copilot

Pull request overview

This PR implements pass rate-based curriculum learning for GRPO training by introducing weighted sampling that prioritizes harder samples (those with lower historical success rates).

Changes:

Added PassRateTracker class to track attempt counts and success rates for each prompt in the dataset
Added PassRateWeightedSampler class that implements curriculum learning through dynamic weighted sampling based on historical pass rates
Integrated curriculum learning into the DAPO trainer with pass rate tracking and curriculum-specific metrics logging
Updated configuration files and training scripts with curriculum learning examples

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
verl/utils/pass_rate_tracker.py	Core tracker for maintaining historical pass rates and attempt counts per sample
verl/utils/pass_rate_weighted_sampler.py	Weighted sampler that adjusts sampling probabilities based on pass rates
verl/utils/dataset/rl_dataset.py	Added dataset_index field to enable sample tracking
verl/trainer/ppo/ray_trainer.py	Added comment clarifying sampler creation
verl/trainer/ppo/metric_utils.py	Added reward standard deviation metric
verl/trainer/config/data/legacy_data.yaml	Added curriculum sampler configuration parameters
recipe/dapo/dapo_ray_trainer.py	Integrated pass rate tracking and curriculum metrics logging into training loop
scripts/*	Added example training scripts demonstrating curriculum learning usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

verl/utils/pass_rate_weighted_sampler.py

Copilot · 2026-01-24T08:39:11Z

recipe/dapo/dapo_ray_trainer.py

+                            point['step'] = self.global_steps
+
+                        metrics['curriculum/weight_distribution_3d'] = wandb.Table(
+                            dataframe=__import__('pandas').DataFrame(weight_3d_data)


The inline imports of wandb and pandas (using import) inside the metrics logging code is an anti-pattern. These should be imported at the module level or handled more cleanly. This dynamic import pattern can cause issues with IDE autocomplete, type checking, and makes dependencies less clear. If wandb/pandas are optional dependencies, consider using a try-except block at the module level and setting a flag.

Copilot · 2026-01-24T08:39:11Z

recipe/dapo/dapo_ray_trainer.py

                num_gen_batches = 0

+                # Add curriculum learning metrics to W&B
+                if self.data_sampler is not None:


The condition if self.data_sampler is not None is insufficient. It should check if the sampler is an instance of PassRateWeightedSampler specifically, since get_wandb_3d_plot_data is only available on that class. Other samplers will cause an AttributeError. Change to: if isinstance(self.data_sampler, PassRateWeightedSampler).

Suggested change

if self.data_sampler is not None:

if isinstance(self.data_sampler, PassRateWeightedSampler):

Copilot · 2026-01-24T08:39:11Z

recipe/dapo/dapo_ray_trainer.py

+                        # Batch-level statistics
+                        metrics['curriculum/min_batch_pass_rate'] = float(np.min(aggregated_successes))
+                        metrics['curriculum/mean_batch_pass_rate'] = float(np.mean(aggregated_successes))
+                        metrics['curriculum/effective_batch_size'] = np.sum(aggregated_successes > 0)/len(unique_indices)


Division by zero is possible when len(unique_indices) is 0, though this is unlikely in practice. Consider adding a check to avoid potential runtime errors in edge cases.

Copilot · 2026-01-24T08:39:12Z

verl/utils/pass_rate_weighted_sampler.py

+        # Option 2: negative exponential scaling
+        x = -pass_rates / max(self.temperature, 0.01)
+        weights = np.exp(x)
+        # weights = np.exp(x - x.max())  # stable softmax exp(-(y - max_y)/temperature)


The get_weights method can produce extremely large weight values when pass_rates=-5.0 (untried samples). With the formula x = -pass_rates / temperature, this gives x=5.0/temperature (e.g., x=10.0 when temperature=0.5), leading to exp(10.0) ≈ 22,026. This can cause numerical instability and dominate sampling. Consider using the commented-out stable softmax implementation (line 71) or clipping the weights to prevent extreme values.

Suggested change

# Option 2: negative exponential scaling

x = -pass_rates / max(self.temperature, 0.01)

weights = np.exp(x)

# weights = np.exp(x - x.max()) # stable softmax exp(-(y - max_y)/temperature)

# Option 2: negative exponential scaling (use stable exponential to avoid overflow)

x = -pass_rates / max(self.temperature, 0.01)

# Use numerically stable form: subtract max before exponentiating

weights = np.exp(x - x.max())

Copilot · 2026-01-24T08:39:13Z

scripts/train/example_singlenode_rl_qwen2.5_7b_base_fsdp.sh


 # --- External Services ---
 export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>"  # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation
+export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty


The MATH_LLM_JUDGE_URL contains a hardcoded internal hostname 'azure-uk-hpc-H200-instance-853:8000'. This should be replaced with a placeholder (like the STEM_LLM_JUDGE_URL pattern) to avoid leaking internal infrastructure details and to make the script more portable across different environments.

Suggested change

export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty

export MATH_LLM_JUDGE_URL="<MATH_LLM_JUDGE_URL>" # Optional: Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty

Copilot · 2026-01-24T08:39:14Z

verl/utils/pass_rate_tracker.py

+    Pass Rate = successes / attempts
+    Weight = (1 - pass_rate) ^ (1 / temperature)
+    
+    Harder samples (lower pass rate) get higher weight and are sampled more.


The class docstring describes the Weight formula as "(1 - pass_rate) ^ (1 / temperature)", but this formula is not implemented in the actual code. The PassRateWeightedSampler uses "exp(-pass_rate / temperature)" instead. This documentation is misleading and should either be removed from this class (since weighting is done in the sampler) or updated to reflect that this is just an example formula.

Suggested change

Pass Rate = successes / attempts

Weight = (1 - pass_rate) ^ (1 / temperature)

Harder samples (lower pass rate) get higher weight and are sampled more.

Pass rate is defined as: Pass Rate = successes / attempts.

This class only tracks pass rates; weighting and sampling strategies are implemented

separately (see `PassRateWeightedSampler` for an example of how pass rates can be

converted into sampling weights).

Copilot · 2026-01-24T08:39:14Z

scripts/train/pass_rate_weighted_sampler_multinode_rl_qwen2.5_32b_base_fsdp.sh

+    data.sampler.pass_rate_temperature=0.5 \
+    data.sampler.use_ema=False \
+    data.sampler.ema_alpha=0.1 \
+    data.prompt_key=prompt \git stat


There's a typo in line 221: \git stat should be removed. This appears to be an accidentally included git command that would cause a syntax error in the script.

Suggested change

data.prompt_key=prompt \git stat

data.prompt_key=prompt \

Copilot · 2026-01-24T08:39:14Z

recipe/dapo/dapo_ray_trainer.py

 from verl.utils.profiler import marked_timer
 from verl.utils.rollout_skip import RolloutSkip
-
+from verl.utils.pass_rate_tracker import PassRateTracker


Import of 'PassRateTracker' is not used.

Suggested change

from verl.utils.pass_rate_tracker import PassRateTracker

Copilot · 2026-01-24T08:39:14Z

scripts/train/example_singlenode_rl_qwen2.5_7b_base_fsdp.sh


 # --- External Services ---
 export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>"  # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation
+export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty


The MATH_LLM_JUDGE_URL is configured to use plain http://, so all OmniMATH scoring requests (including prompts, model outputs, and scores) will be transmitted in cleartext over the cluster network. An attacker or malicious tenant with network access could sniff or tamper with this traffic, corrupting evaluation results or exfiltrating potentially sensitive data. Use an HTTPS endpoint for the math judge service and ensure TLS certificate validation is enabled so these requests are encrypted and integrity-protected.

Suggested change

export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty

export MATH_LLM_JUDGE_URL="https://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge HTTPS URL, only used to score OmniMATH dataset if not empty

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

pass rate based weighted sampler tested with local and multi-node runs

97d47ee

jb3618columbia requested review from Copilot and nightlessbaron January 24, 2026 08:22

Copilot started reviewing on behalf of jb3618columbia January 24, 2026 08:31 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

Update verl/utils/pass_rate_weighted_sampler.py

86436d6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pass rate-based curriculum learning with weighted sampling #153

Implement pass rate-based curriculum learning with weighted sampling #153

jb3618columbia commented Jan 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Copilot AI Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if self.data_sampler is not None:
	if isinstance(self.data_sampler, PassRateWeightedSampler):

	export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
	export MATH_LLM_JUDGE_URL="<MATH_LLM_JUDGE_URL>" # Optional: Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty

-    Pass Rate = successes / attempts
-    Weight = (1 - pass_rate) ^ (1 / temperature)
-    Harder samples (lower pass rate) get higher weight and are sampled more.
+    Pass rate is defined as: Pass Rate = successes / attempts.
+    This class only tracks pass rates; weighting and sampling strategies are implemented
+    separately (see `PassRateWeightedSampler` for an example of how pass rates can be
+    converted into sampling weights).

	export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
	export MATH_LLM_JUDGE_URL="https://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge HTTPS URL, only used to score OmniMATH dataset if not empty

Implement pass rate-based curriculum learning with weighted sampling #153

Are you sure you want to change the base?

Implement pass rate-based curriculum learning with weighted sampling #153

Conversation

jb3618columbia commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jb3618columbia commented Jan 24, 2026 •

edited

Loading