Skip to content

Conversation

@jb3618columbia
Copy link
Collaborator

@jb3618columbia jb3618columbia commented Jan 24, 2026

Summary

Implements curriculum learning using pass rate-based weighted sampling for GRPO training

Changes

  • Add a PassRateTracker class to track attempt and success counts for each prompt. This tracker can be used for multiple curriculum samplers
  • Add a PassRateWeightedSampler class which implements a weighted sampler that adjusts sampling probabilities (probability of prompts sampled in a batch) based on historical pass rates (optional to use exponential moving average)
  • Update DAPOTrainer: pass rate tracker during training and logs curriculum metrics
  • Minor edits to make the integration work

Testing

  • Tested with local single-node runs

  • Tested with multi-node SLURM runs (2 nodes, 8 GPUs each)

  • Logs curriculum metrics: hardest_10pct/25pct/50pct/75pct pass rates, batch-level statistics

  • See curriculum learning runs: https://wandb.ai/mbzuai-llm/Reasoning360/runs/qab27nv0?nw=nwuserjalajbhandari

  • Example run: Curriculum-1435219-qwen2.5-32b-base-fsdp-temp_0.5_data_mixtures_round2_train_prompt_bsz_32

  1. Effective batch size increases with training, decreases when hard samples get sampled and then increases as the model learns to solve hard problems
Screenshot 2026-01-24 at 12 12 09 AM
  1. Pass rates of hard examples increase with training: model focuses on harder problems and starts to solve these (tracking the percentile of hard problems based on historical pass rates)
Screenshot 2026-01-24 at 12 15 49 AM Screenshot 2026-01-24 at 12 19 12 AM
  1. Right skew to the count distribution which shows the some prompts are only attempted a few times (easy prompts) while other samples are attempted multiple times (hard prompts)
Screenshot 2026-01-24 at 12 20 03 AM

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements pass rate-based curriculum learning for GRPO training by introducing weighted sampling that prioritizes harder samples (those with lower historical success rates).

Changes:

  • Added PassRateTracker class to track attempt counts and success rates for each prompt in the dataset
  • Added PassRateWeightedSampler class that implements curriculum learning through dynamic weighted sampling based on historical pass rates
  • Integrated curriculum learning into the DAPO trainer with pass rate tracking and curriculum-specific metrics logging
  • Updated configuration files and training scripts with curriculum learning examples

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
verl/utils/pass_rate_tracker.py Core tracker for maintaining historical pass rates and attempt counts per sample
verl/utils/pass_rate_weighted_sampler.py Weighted sampler that adjusts sampling probabilities based on pass rates
verl/utils/dataset/rl_dataset.py Added dataset_index field to enable sample tracking
verl/trainer/ppo/ray_trainer.py Added comment clarifying sampler creation
verl/trainer/ppo/metric_utils.py Added reward standard deviation metric
verl/trainer/config/data/legacy_data.yaml Added curriculum sampler configuration parameters
recipe/dapo/dapo_ray_trainer.py Integrated pass rate tracking and curriculum metrics logging into training loop
scripts/* Added example training scripts demonstrating curriculum learning usage

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

point['step'] = self.global_steps

metrics['curriculum/weight_distribution_3d'] = wandb.Table(
dataframe=__import__('pandas').DataFrame(weight_3d_data)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline imports of wandb and pandas (using import) inside the metrics logging code is an anti-pattern. These should be imported at the module level or handled more cleanly. This dynamic import pattern can cause issues with IDE autocomplete, type checking, and makes dependencies less clear. If wandb/pandas are optional dependencies, consider using a try-except block at the module level and setting a flag.

Copilot uses AI. Check for mistakes.
num_gen_batches = 0

# Add curriculum learning metrics to W&B
if self.data_sampler is not None:
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if self.data_sampler is not None is insufficient. It should check if the sampler is an instance of PassRateWeightedSampler specifically, since get_wandb_3d_plot_data is only available on that class. Other samplers will cause an AttributeError. Change to: if isinstance(self.data_sampler, PassRateWeightedSampler).

Suggested change
if self.data_sampler is not None:
if isinstance(self.data_sampler, PassRateWeightedSampler):

Copilot uses AI. Check for mistakes.
# Batch-level statistics
metrics['curriculum/min_batch_pass_rate'] = float(np.min(aggregated_successes))
metrics['curriculum/mean_batch_pass_rate'] = float(np.mean(aggregated_successes))
metrics['curriculum/effective_batch_size'] = np.sum(aggregated_successes > 0)/len(unique_indices)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero is possible when len(unique_indices) is 0, though this is unlikely in practice. Consider adding a check to avoid potential runtime errors in edge cases.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +71
# Option 2: negative exponential scaling
x = -pass_rates / max(self.temperature, 0.01)
weights = np.exp(x)
# weights = np.exp(x - x.max()) # stable softmax exp(-(y - max_y)/temperature)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_weights method can produce extremely large weight values when pass_rates=-5.0 (untried samples). With the formula x = -pass_rates / temperature, this gives x=5.0/temperature (e.g., x=10.0 when temperature=0.5), leading to exp(10.0) ≈ 22,026. This can cause numerical instability and dominate sampling. Consider using the commented-out stable softmax implementation (line 71) or clipping the weights to prevent extreme values.

Suggested change
# Option 2: negative exponential scaling
x = -pass_rates / max(self.temperature, 0.01)
weights = np.exp(x)
# weights = np.exp(x - x.max()) # stable softmax exp(-(y - max_y)/temperature)
# Option 2: negative exponential scaling (use stable exponential to avoid overflow)
x = -pass_rates / max(self.temperature, 0.01)
# Use numerically stable form: subtract max before exponentiating
weights = np.exp(x - x.max())

Copilot uses AI. Check for mistakes.

# --- External Services ---
export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>" # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation
export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MATH_LLM_JUDGE_URL contains a hardcoded internal hostname 'azure-uk-hpc-H200-instance-853:8000'. This should be replaced with a placeholder (like the STEM_LLM_JUDGE_URL pattern) to avoid leaking internal infrastructure details and to make the script more portable across different environments.

Suggested change
export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
export MATH_LLM_JUDGE_URL="<MATH_LLM_JUDGE_URL>" # Optional: Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +19
Pass Rate = successes / attempts
Weight = (1 - pass_rate) ^ (1 / temperature)
Harder samples (lower pass rate) get higher weight and are sampled more.
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring describes the Weight formula as "(1 - pass_rate) ^ (1 / temperature)", but this formula is not implemented in the actual code. The PassRateWeightedSampler uses "exp(-pass_rate / temperature)" instead. This documentation is misleading and should either be removed from this class (since weighting is done in the sampler) or updated to reflect that this is just an example formula.

Suggested change
Pass Rate = successes / attempts
Weight = (1 - pass_rate) ^ (1 / temperature)
Harder samples (lower pass rate) get higher weight and are sampled more.
Pass rate is defined as: Pass Rate = successes / attempts.
This class only tracks pass rates; weighting and sampling strategies are implemented
separately (see `PassRateWeightedSampler` for an example of how pass rates can be
converted into sampling weights).

Copilot uses AI. Check for mistakes.
data.sampler.pass_rate_temperature=0.5 \
data.sampler.use_ema=False \
data.sampler.ema_alpha=0.1 \
data.prompt_key=prompt \git stat
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in line 221: \git stat should be removed. This appears to be an accidentally included git command that would cause a syntax error in the script.

Suggested change
data.prompt_key=prompt \git stat
data.prompt_key=prompt \

Copilot uses AI. Check for mistakes.
from verl.utils.profiler import marked_timer
from verl.utils.rollout_skip import RolloutSkip

from verl.utils.pass_rate_tracker import PassRateTracker
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'PassRateTracker' is not used.

Suggested change
from verl.utils.pass_rate_tracker import PassRateTracker

Copilot uses AI. Check for mistakes.

# --- External Services ---
export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>" # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation
export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MATH_LLM_JUDGE_URL is configured to use plain http://, so all OmniMATH scoring requests (including prompts, model outputs, and scores) will be transmitted in cleartext over the cluster network. An attacker or malicious tenant with network access could sniff or tamper with this traffic, corrupting evaluation results or exfiltrating potentially sensitive data. Use an HTTPS endpoint for the math judge service and ensure TLS certificate validation is enabled so these requests are encrypted and integrity-protected.

Suggested change
export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty
export MATH_LLM_JUDGE_URL="https://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge HTTPS URL, only used to score OmniMATH dataset if not empty

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants