-
Notifications
You must be signed in to change notification settings - Fork 16
Implement pass rate-based curriculum learning with weighted sampling #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: verl-latest-cispo
Are you sure you want to change the base?
Implement pass rate-based curriculum learning with weighted sampling #153
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements pass rate-based curriculum learning for GRPO training by introducing weighted sampling that prioritizes harder samples (those with lower historical success rates).
Changes:
- Added
PassRateTrackerclass to track attempt counts and success rates for each prompt in the dataset - Added
PassRateWeightedSamplerclass that implements curriculum learning through dynamic weighted sampling based on historical pass rates - Integrated curriculum learning into the DAPO trainer with pass rate tracking and curriculum-specific metrics logging
- Updated configuration files and training scripts with curriculum learning examples
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| verl/utils/pass_rate_tracker.py | Core tracker for maintaining historical pass rates and attempt counts per sample |
| verl/utils/pass_rate_weighted_sampler.py | Weighted sampler that adjusts sampling probabilities based on pass rates |
| verl/utils/dataset/rl_dataset.py | Added dataset_index field to enable sample tracking |
| verl/trainer/ppo/ray_trainer.py | Added comment clarifying sampler creation |
| verl/trainer/ppo/metric_utils.py | Added reward standard deviation metric |
| verl/trainer/config/data/legacy_data.yaml | Added curriculum sampler configuration parameters |
| recipe/dapo/dapo_ray_trainer.py | Integrated pass rate tracking and curriculum metrics logging into training loop |
| scripts/* | Added example training scripts demonstrating curriculum learning usage |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| point['step'] = self.global_steps | ||
|
|
||
| metrics['curriculum/weight_distribution_3d'] = wandb.Table( | ||
| dataframe=__import__('pandas').DataFrame(weight_3d_data) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inline imports of wandb and pandas (using import) inside the metrics logging code is an anti-pattern. These should be imported at the module level or handled more cleanly. This dynamic import pattern can cause issues with IDE autocomplete, type checking, and makes dependencies less clear. If wandb/pandas are optional dependencies, consider using a try-except block at the module level and setting a flag.
| num_gen_batches = 0 | ||
|
|
||
| # Add curriculum learning metrics to W&B | ||
| if self.data_sampler is not None: |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if self.data_sampler is not None is insufficient. It should check if the sampler is an instance of PassRateWeightedSampler specifically, since get_wandb_3d_plot_data is only available on that class. Other samplers will cause an AttributeError. Change to: if isinstance(self.data_sampler, PassRateWeightedSampler).
| if self.data_sampler is not None: | |
| if isinstance(self.data_sampler, PassRateWeightedSampler): |
| # Batch-level statistics | ||
| metrics['curriculum/min_batch_pass_rate'] = float(np.min(aggregated_successes)) | ||
| metrics['curriculum/mean_batch_pass_rate'] = float(np.mean(aggregated_successes)) | ||
| metrics['curriculum/effective_batch_size'] = np.sum(aggregated_successes > 0)/len(unique_indices) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero is possible when len(unique_indices) is 0, though this is unlikely in practice. Consider adding a check to avoid potential runtime errors in edge cases.
| # Option 2: negative exponential scaling | ||
| x = -pass_rates / max(self.temperature, 0.01) | ||
| weights = np.exp(x) | ||
| # weights = np.exp(x - x.max()) # stable softmax exp(-(y - max_y)/temperature) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_weights method can produce extremely large weight values when pass_rates=-5.0 (untried samples). With the formula x = -pass_rates / temperature, this gives x=5.0/temperature (e.g., x=10.0 when temperature=0.5), leading to exp(10.0) ≈ 22,026. This can cause numerical instability and dominate sampling. Consider using the commented-out stable softmax implementation (line 71) or clipping the weights to prevent extreme values.
| # Option 2: negative exponential scaling | |
| x = -pass_rates / max(self.temperature, 0.01) | |
| weights = np.exp(x) | |
| # weights = np.exp(x - x.max()) # stable softmax exp(-(y - max_y)/temperature) | |
| # Option 2: negative exponential scaling (use stable exponential to avoid overflow) | |
| x = -pass_rates / max(self.temperature, 0.01) | |
| # Use numerically stable form: subtract max before exponentiating | |
| weights = np.exp(x - x.max()) |
|
|
||
| # --- External Services --- | ||
| export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>" # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation | ||
| export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MATH_LLM_JUDGE_URL contains a hardcoded internal hostname 'azure-uk-hpc-H200-instance-853:8000'. This should be replaced with a placeholder (like the STEM_LLM_JUDGE_URL pattern) to avoid leaking internal infrastructure details and to make the script more portable across different environments.
| export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty | |
| export MATH_LLM_JUDGE_URL="<MATH_LLM_JUDGE_URL>" # Optional: Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty |
| Pass Rate = successes / attempts | ||
| Weight = (1 - pass_rate) ^ (1 / temperature) | ||
| Harder samples (lower pass rate) get higher weight and are sampled more. |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class docstring describes the Weight formula as "(1 - pass_rate) ^ (1 / temperature)", but this formula is not implemented in the actual code. The PassRateWeightedSampler uses "exp(-pass_rate / temperature)" instead. This documentation is misleading and should either be removed from this class (since weighting is done in the sampler) or updated to reflect that this is just an example formula.
| Pass Rate = successes / attempts | |
| Weight = (1 - pass_rate) ^ (1 / temperature) | |
| Harder samples (lower pass rate) get higher weight and are sampled more. | |
| Pass rate is defined as: Pass Rate = successes / attempts. | |
| This class only tracks pass rates; weighting and sampling strategies are implemented | |
| separately (see `PassRateWeightedSampler` for an example of how pass rates can be | |
| converted into sampling weights). |
| data.sampler.pass_rate_temperature=0.5 \ | ||
| data.sampler.use_ema=False \ | ||
| data.sampler.ema_alpha=0.1 \ | ||
| data.prompt_key=prompt \git stat |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in line 221: \git stat should be removed. This appears to be an accidentally included git command that would cause a syntax error in the script.
| data.prompt_key=prompt \git stat | |
| data.prompt_key=prompt \ |
| from verl.utils.profiler import marked_timer | ||
| from verl.utils.rollout_skip import RolloutSkip | ||
|
|
||
| from verl.utils.pass_rate_tracker import PassRateTracker |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'PassRateTracker' is not used.
| from verl.utils.pass_rate_tracker import PassRateTracker |
|
|
||
| # --- External Services --- | ||
| export STEM_LLM_JUDGE_URL="<STEM_LLM_JUDGE_URL>" # Optional: Fill in the llm-as-judge hosted URL for 'STEM' domain evaluation | ||
| export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MATH_LLM_JUDGE_URL is configured to use plain http://, so all OmniMATH scoring requests (including prompts, model outputs, and scores) will be transmitted in cleartext over the cluster network. An attacker or malicious tenant with network access could sniff or tamper with this traffic, corrupting evaluation results or exfiltrating potentially sensitive data. Use an HTTPS endpoint for the math judge service and ensure TLS certificate validation is enabled so these requests are encrypted and integrity-protected.
| export MATH_LLM_JUDGE_URL="http://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge hosted URL, only used to score OmniMATH dataset if not empty | |
| export MATH_LLM_JUDGE_URL="https://azure-uk-hpc-H200-instance-853:8000" # Fill in the OmniMATH llm-as-judge HTTPS URL, only used to score OmniMATH dataset if not empty |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Implements curriculum learning using pass rate-based weighted sampling for GRPO training
Changes
PassRateTrackerclass to track attempt and success counts for each prompt. This tracker can be used for multiple curriculum samplersPassRateWeightedSamplerclass which implements a weighted sampler that adjusts sampling probabilities (probability of prompts sampled in a batch) based on historical pass rates (optional to use exponential moving average)Testing
Tested with local single-node runs
Tested with multi-node SLURM runs (2 nodes, 8 GPUs each)
Logs curriculum metrics: hardest_10pct/25pct/50pct/75pct pass rates, batch-level statistics
See curriculum learning runs: https://wandb.ai/mbzuai-llm/Reasoning360/runs/qab27nv0?nw=nwuserjalajbhandari
Example run: Curriculum-1435219-qwen2.5-32b-base-fsdp-temp_0.5_data_mixtures_round2_train_prompt_bsz_32