[WIP] R1-Zero-like experiments #569

lewtun · 2025-04-01T08:36:40Z

Context: https://huggingface.co/spaces/open-r1/README/discussions/20

Slurm commands

--- v00 ablations ---

# v00.0X (all)
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.00 --output_dir=data/R1-Zero-Qwen-7B-v00.00 --run_name=v00.00 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# equal reward weights
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.01 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.01 --output_dir=data/R1-Zero-Qwen-7B-v00.01 --run_name=v00.01 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# re-run baseline with new TRL
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02 --output_dir=data/R1-Zero-Qwen-7B-v00.02 --run_name=v00.02 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with soft format reward
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02-soft-format --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_funcs accuracy format soft_format --reward_weights 1.0 1.0 1.0 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02-soft-format --output_dir=data/R1-Zero-Qwen-7B-v00.02-soft-format --run_name=v00.02-soft-format --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with soft format reward 0.5
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02-soft-format-0.5 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_funcs accuracy format soft_format --reward_weights 1.0 0.5 0.5 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02-soft-format-0.5 --output_dir=data/R1-Zero-Qwen-7B-v00.02-soft-format-0.5 --run_name=v00.02-soft-format-0.5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with soft format reward 0.25
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02-soft-format-0.25 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_funcs accuracy format soft_format --reward_weights 1.0 0.25 0.25 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02-soft-format-0.25 --output_dir=data/R1-Zero-Qwen-7B-v00.02-soft-format-0.25 --run_name=v00.02-soft-format-0.25 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with soft format reward 0.125
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02-soft-format-0.125 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_funcs accuracy format soft_format --reward_weights 1.0 0.125 0.125 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02-soft-format-0.125 --output_dir=data/R1-Zero-Qwen-7B-v00.02-soft-format-0.125 --run_name=v00.02-soft-format-0.125 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with overlong mask
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.02-with-mask --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --mask_truncated_completions=true --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.02-with-mask --output_dir=data/R1-Zero-Qwen-7B-v00.02-with-mask --run_name=v00.02-with-mask --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# clip higher
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.03 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --epsilon_high=0.28 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.03 --output_dir=data/R1-Zero-Qwen-7B-v00.03 --run_name=v00.03 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=2
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.04 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=2 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.04 --output_dir=data/R1-Zero-Qwen-7B-v00.04 --run_name=v00.04 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.05 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=4 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.05 --output_dir=data/R1-Zero-Qwen-7B-v00.05 --run_name=v00.05 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4 with mask
# sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.05-with-mask --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=4 --mask_truncated_completions=true --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.05-with-mask --output_dir=data/R1-Zero-Qwen-7B-v00.05-with-mask --run_name=v00.05-with-mask --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# beta=0
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.06 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --beta=0.0 --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.06 --output_dir=data/R1-Zero-Qwen-7B-v00.06 --run_name=v00.06 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# beta=0 with mask
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.06-with-mask --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --beta=0.0 --mask_truncated_completions=true --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.06-with-mask --output_dir=data/R1-Zero-Qwen-7B-v00.06-with-mask --run_name=v00.06-with-mask --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with no sync
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.07 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.07 --output_dir=data/R1-Zero-Qwen-7B-v00.07 --run_name=v00.07 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# dr grpo loss (no scaling)
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.08 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --scale_rewards=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.08 --output_dir=data/R1-Zero-Qwen-7B-v00.08 --run_name=v00.08 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# dr grpo (no ref model)
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.09 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --scale_rewards=false --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.09 --output_dir=data/R1-Zero-Qwen-7B-v00.09 --run_name=v00.09 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4, beta=0
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.10 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=4 --beta=0.0 --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.10 --output_dir=data/R1-Zero-Qwen-7B-v00.10 --run_name=v00.10 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4, beta=0 with mask
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.10-with-mask --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=4 --beta=0.0 --sync_ref_model=false --mask_truncated_completions=true --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.10-with-mask --output_dir=data/R1-Zero-Qwen-7B-v00.10-with-mask --run_name=v00.10-with-mask --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=2, beta=0
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v00.11 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v00.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --reward_weights 1.0 1.0 --num_iterations=2 --beta=0.0 --sync_ref_model=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v00.11 --output_dir=data/R1-Zero-Qwen-7B-v00.11 --run_name=v00.11 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'

--- v01.0X (best settings) ---

# v01.0X (baselines)
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.00 --output_dir=data/R1-Zero-Qwen-7B-v01.00 --run_name=v01.00_baseline --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with 64 unique prompts per batch
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.00-bs-64 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --gradient_accumulation_steps=32 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.00-bs-64 --output_dir=data/R1-Zero-Qwen-7B-v01.00-bs-64 --run_name=v01.00_baseline_bs-64 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with 128 unique prompts per batch
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.00-bs-128 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --gradient_accumulation_steps=64 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.00-bs-128 --output_dir=data/R1-Zero-Qwen-7B-v01.00-bs-128 --run_name=v01.00_baseline_bs-128 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.01 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=4 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.01 --output_dir=data/R1-Zero-Qwen-7B-v01.01 --run_name=v01.01_mu-4 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4 with format=0.5
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.02 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=4 --reward_weights 1.0 0.5 0.5 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.02 --output_dir=data/R1-Zero-Qwen-7B-v01.02 --run_name=v01.02_mu-4_format-0.5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=2 with format=0.5
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.03 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=2 --reward_weights 1.0 0.5 0.5 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.03 --output_dir=data/R1-Zero-Qwen-7B-v01.03 --run_name=v01.03_mu-2_format-0.5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# dr grpo
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v01.04 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v01.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --loss_type=dr_grpo --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v01.04 --output_dir=data/R1-Zero-Qwen-7B-v01.04 --run_name=v01.04_dr-grpo --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'


--- v02.0X (levels 2-5) ---
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v02.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v02.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v02.00 --output_dir=data/R1-Zero-Qwen-7B-v02.00 --run_name=v02.00_level-2-5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'

--- v03.0X (levels 3-5) ---
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v03.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v03.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v03.00 --output_dir=data/R1-Zero-Qwen-7B-v03.00 --run_name=v03.00_level-3-5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'

--- v04.0X (levels 4-5) ---
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v04.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v04.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v04.00 --output_dir=data/R1-Zero-Qwen-7B-v04.00 --run_name=v04.00_level-4-5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# level 5 only
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v04.10 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v04.10 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v04.10 --output_dir=data/R1-Zero-Qwen-7B-v04.10 --run_name=v04.10_level-5 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'


--- v05.0X (DAPO) ---
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.00 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.00 --output_dir=data/R1-Zero-Qwen-7B-v05.00 --run_name=v05.00_baseline --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# clip higher
# sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.01 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --epsilon_high=0.28 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.01 --output_dir=data/R1-Zero-Qwen-7B-v05.01 --run_name=v05.01_eps-high-0.28 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# dr grpo
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.02 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --loss_type=dr_grpo --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.02 --output_dir=data/R1-Zero-Qwen-7B-v05.02 --run_name=v05.02_dr-grpo --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# scaled rewards
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.03 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --scale_rewards=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.03 --output_dir=data/R1-Zero-Qwen-7B-v05.03 --run_name=v05.03_dr-grpo --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# dr grpo w/out scaled rewards
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.04 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --loss_type=dr_grpo --scale_rewards=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.04 --output_dir=data/R1-Zero-Qwen-7B-v05.04 --run_name=v05.04_dr-grpo_scale-reward-false --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# no masking
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.05 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --mask_truncated_completions=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.05 --output_dir=data/R1-Zero-Qwen-7B-v05.05 --run_name=v05.05_no-mask --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# # mu=2
# sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.06 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --scale_rewards=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.06 --output_dir=data/R1-Zero-Qwen-7B-v05.06 --run_name=v05.06_mu-2 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# # mu=4
# sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.07 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --scale_rewards=false --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.07 --output_dir=data/R1-Zero-Qwen-7B-v05.07 --run_name=v05.07_mu-4 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=2
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.08 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=2 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.08 --output_dir=data/R1-Zero-Qwen-7B-v05.08 --run_name=v05.08_mu-2 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.09 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=4 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.09 --output_dir=data/R1-Zero-Qwen-7B-v05.09 --run_name=v05.09_mu-4 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# bs=64
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.10 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --gradient_accumulation_steps=32 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.10 --output_dir=data/R1-Zero-Qwen-7B-v05.10 --run_name=v05.10_bs-64 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# bs=128
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.11 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --gradient_accumulation_steps=64 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.11-fix --output_dir=data/R1-Zero-Qwen-7B-v05.11-fix --run_name=v05.11_bs-128 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# max_tokens=16k
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.12 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --max_completion_length=16384 --gradient_accumulation_steps=32 --per_device_train_batch_size=2 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.12 --output_dir=data/R1-Zero-Qwen-7B-v05.12 --run_name=v05.12_ctx-16k --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# mu=4, eps=0.28
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.13 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --num_iterations=4 --epsilon_high=0.28 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.13 --output_dir=data/R1-Zero-Qwen-7B-v05.13 --run_name=v05.13_mu-4_eps-0.28 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with beta=0.001
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.14 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --beta=0.001 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.14 --output_dir=data/R1-Zero-Qwen-7B-v05.14 --run_name=v05.14_beta-0.001 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline once per batch
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.15 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.15 --output_dir=data/R1-Zero-Qwen-7B-v05.15 --run_name=v05.15_baseline_once-per-batch --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with DP=4, TP=2
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.16 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.16 --output_dir=data/R1-Zero-Qwen-7B-v05.16 --run_name=v05.16_dp-4-tp-2 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with DP=2, TP=4
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.17 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.17 --output_dir=data/R1-Zero-Qwen-7B-v05.17 --run_name=v05.17_dp-2-tp-4 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with DP=1, TP=4
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.18 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.18 --output_dir=data/R1-Zero-Qwen-7B-v05.18 --run_name=v05.18_dp-1-tp-4 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with DP=8, TP=1
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.19 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --args '--learning_rate=1.0e-6 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.19 --output_dir=data/R1-Zero-Qwen-7B-v05.19 --run_name=v05.19_dp-8-tp-1 --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'
# baseline with 3 epochs
sbatch --mail-type=ALL --mail-user=lewis+hfc@huggingface.co  --output=/fsx/h4/logs/%x-%j.out --err=/fsx/h4/logs/%x-%j.err --job-name=r1-zero-7b-v05.20 --nodes=2 slurm/train.slurm --model OpenR1-Zero-7B-Math --task grpo --config v05.00 --accelerator zero3 --dp 4 --tp 2 --args --args '--learning_rate=1.0e-6 --num_train_epochs=3 --hub_model_id=open-r1/R1-Zero-Qwen-7B-Math --hub_model_revision=v05.20 --output_dir=data/R1-Zero-Qwen-7B-v05.20 --run_name=v05.20_3-epochs --wandb_entity=huggingface --wandb_project=open-r1 --wandb_run_group=r1-zero-qwen-7b-math'

lewtun added 7 commits March 29, 2025 12:05

Add R1 Zero 7B

b5e6f9c

Fix chat template

8a4af61

Add new difficulty levels

9e0e478

Add medium, hard, ultra hard recipes

b35213c

Fix accuracy rewards

1d6c0bb

Return None for invalid samples

5747cfc

Fix order of inputs

1078b73

lewtun marked this pull request as draft April 1, 2025 08:37

lewtun added 22 commits April 1, 2025 08:37

Use None for unferified

d9c8cd8

Merge branch 'main' into r1-zero

8f26046

Pin trl

5fe41f0

Set defaults

f22657b

Log unique only

82a1167

Revert config

2897519

Use proper dataset

d51de45

Pin TRL

f1832c5

Clean up

995beb8

Merge branch 'main' into r1-zero

1d7d66a

Add soft format reward

10a555b

Fix soft reward to be really soft

0f98a5a

Merge branch 'main' into r1-zero

23b7b69

Pin TRL for overlong masking

f62e42a

Fix liger

939c74c

Add v01

9bed487

Add level configs and DAPO

b29e672

Fix

7a8dead

Merge branch 'main' into r1-zero

2d74588

Add q3

c1d2352

Parse GAS

8500f41

Add hack for lighteval

3c312f8

lewtun and others added 19 commits April 16, 2025 14:44

Merge branch 'main' into r1-zero

b6a73c0

Merge branch 'main' into r1-zero

a5f3baa

Pin TRL

f3920f8

Merge branch 'main' into r1-zero

06bdd50

Add 32B recipe

2f0b983

Fix sharding in Slurm

be72ce6

Tune recipe

0df1654

Fix attempt on Slurm

c24ffd7

Hack

2715d31

Wait

cebaad5

Revert slurm

2f4b0da

Fix

f27c732

Remove hf-transfer in favour of hf-xet

5f0b8f8

Pin transformers

46c1656

Merge branch 'main' into r1-zero

2c0cac5

add gen batch exp config

8d993d5

adds weighted code reward

a82c1fd

add latest configs

d9a6c08

Merge branch 'main' into r1-zero

464d951

lewtun mentioned this pull request May 9, 2025

When I run the GRPO demo, I find that format_reward is always 0！！！ #235

Open

edbeeching and others added 3 commits May 9, 2025 20:09

Merge branch 'main' into r1-zero

b430693

Merge branch 'main' into r1-zero

0ed9ea3

Merge branch 'main' into r1-zero

a401d64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] R1-Zero-like experiments #569

[WIP] R1-Zero-like experiments #569

Uh oh!

lewtun commented Apr 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] R1-Zero-like experiments #569

Are you sure you want to change the base?

[WIP] R1-Zero-like experiments #569

Uh oh!

Conversation

lewtun commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Slurm commands

Uh oh!

Uh oh!

lewtun commented Apr 1, 2025 •

edited

Loading