Open
Description
Introduction:
First, thank you for the excellent paper and codebase.
Goal:
I am attempting to reproduce the training results for the s1.1-32B
experiment as reported.
My Setup:
- Model:
Qwen/Qwen2.5-32B-Instruct
- Key Parameter:
block_size=20000
- Hardware: 16 x A100 80G GPUs
- I used
train/sft_multinode.sh
for training and did not modify any other code.
Problem:
The training loss curve I am observing during my retraining run does not match the curve provided in your primary WandB report for this experiment.
- Expected Behavior (based on WandB): Training loss should decrease steadily and reach a value around 0.4.
- Actual Behavior: My training loss curve follows a trajectory more similar to the example curve shown in the paper's Figure 9. The loss seems to decrease in steps and eventually settles at a lower value (i.e., lower than 0.1).
Request:
Could you clarify if there are known differences in configuration or setup between the run shown in the WandB report (achieving ~0.4 loss) and the conditions that might lead to the curve shape shown in the appendix? Any guidance on replicating the ~0.4 loss result would be appreciated.
Thank you for your time and assistance.
Metadata
Metadata
Assignees
Labels
No labels