Performance on Qwen2.5-7B-Instruct

Thank you for the excellent work! 

I trained the Qwen2.5-7B-Instruct model using the provided training script on 4 H100 GPUs. To prevent out-of-memory errors, I set the mini_batch_size to 2 while keeping all other parameters at their default values. Below is my training script:

```python
uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=2 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--num_train_epochs=${epochs} \
--max_steps=${max_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="no" \
--eval_steps=50 \
--logging_steps=1 \
--save_strategy="no" \
--lr_scheduler_type="cosine" \
--learning_rate=${lr} \
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir="ckpts/s1_${uid}" \
--hub_model_id="simplescaling7b/s1-${uid}" \
--push_to_hub=${push_to_hub} \
--save_only_model=True \
--gradient_checkpointing=True
```

Here is my training loss curve, which closely resembles the one reported in the paper.
<img width="1338" alt="Image" src="https://github.com/user-attachments/assets/5f0f33cd-e03f-475b-8438-14098da8f532" />


However, the performance on the evaluation benchmarks has not shown a significant improvement. Specifically, the original Qwen-2.5-7B-Instruct model achieves 16.67% on AIME 2024, 33.84% on GPQA Diamond, and 77% on MATH500. After fine-tuning, the results are 16.67% on AIME 2024, 37.37% on GPQA Diamond, and 75.2% on MATH500, indicating only a marginal improvement.

I’m wondering whether s1k is specifically designed for the Qwen2.5-32B-Instruct model or if it can generalize to models of different sizes. Thank you!

Initial Qwen-2.5-7B-Instruct results:
```json
"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.3383838383838384,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.77,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },
```

Results after model fine-tuning:
```json
"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.37373737373737376,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.752,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on Qwen2.5-7B-Instruct #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance on Qwen2.5-7B-Instruct #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions