Skip to content

Performance on Qwen2.5-7B-Instruct #42

Open
@lichangh20

Description

@lichangh20

Thank you for the excellent work!

I trained the Qwen2.5-7B-Instruct model using the provided training script on 4 H100 GPUs. To prevent out-of-memory errors, I set the mini_batch_size to 2 while keeping all other parameters at their default values. Below is my training script:

uid="$(date +%Y%m%d_%H%M%S)"
base_model="Qwen/Qwen2.5-7B-Instruct"
lr=1e-5
min_lr=0
epochs=5
weight_decay=1e-4 # -> the same training pipe as slurm_training
micro_batch_size=2 # -> batch_size will be 16 if 8 gpus
push_to_hub=false
gradient_accumulation_steps=1
max_steps=-1
gpu_count=$(nvidia-smi -L | wc -l)

torchrun --nproc-per-node ${gpu_count} --master_port 12345 \
train/sft.py \
--per_device_train_batch_size=${micro_batch_size} \
--per_device_eval_batch_size=${micro_batch_size} \
--gradient_accumulation_steps=${gradient_accumulation_steps} \
--num_train_epochs=${epochs} \
--max_steps=${max_steps} \
--train_file_path="simplescaling/s1K_tokenized" \
--model_name=${base_model} \
--warmup_ratio=0.05 \
--fsdp="full_shard auto_wrap" \
--fsdp_config="train/fsdp_config_qwen.json" \
--bf16=True \
--eval_strategy="no" \
--eval_steps=50 \
--logging_steps=1 \
--save_strategy="no" \
--lr_scheduler_type="cosine" \
--learning_rate=${lr} \
--weight_decay=${weight_decay} \
--adam_beta1=0.9 \
--adam_beta2=0.95 \
--output_dir="ckpts/s1_${uid}" \
--hub_model_id="simplescaling7b/s1-${uid}" \
--push_to_hub=${push_to_hub} \
--save_only_model=True \
--gradient_checkpointing=True

Here is my training loss curve, which closely resembles the one reported in the paper.
Image

However, the performance on the evaluation benchmarks has not shown a significant improvement. Specifically, the original Qwen-2.5-7B-Instruct model achieves 16.67% on AIME 2024, 33.84% on GPQA Diamond, and 77% on MATH500. After fine-tuning, the results are 16.67% on AIME 2024, 37.37% on GPQA Diamond, and 75.2% on MATH500, indicating only a marginal improvement.

I’m wondering whether s1k is specifically designed for the Qwen2.5-32B-Instruct model or if it can generalize to models of different sizes. Thank you!

Initial Qwen-2.5-7B-Instruct results:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.3383838383838384,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.77,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },

Results after model fine-tuning:

"results": {
    "aime24_nofigures": {
      "alias": "aime24_nofigures",
      "exact_match,none": 0.16666666666666666,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "gpqa_diamond_openai": {
      "alias": "gpqa_diamond_openai",
      "exact_match,none": 0.37373737373737376,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    },
    "openai_math": {
      "alias": "openai_math",
      "exact_match,none": 0.752,
      "exact_match_stderr,none": "N/A",
      "extracted_answers,none": -1,
      "extracted_answers_stderr,none": "N/A"
    }
  },

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions