Issue Reproducing s1.1-32B Training Loss (Observed vs. WandB)

**Introduction:**
First, thank you for the excellent paper and codebase.

**Goal:**
I am attempting to reproduce the training results for the `s1.1-32B` experiment as reported.

**My Setup:**
*   Model: `Qwen/Qwen2.5-32B-Instruct`
*   Key Parameter: `block_size=20000`
*   Hardware: 16 x A100 80G GPUs
*   I used `train/sft_multinode.sh` for training and did not modify any other code.

**Problem:**
The training loss curve I am observing during my retraining run does not match the curve provided in your primary [WandB report](https://wandb.ai/hashimoto-group/o1/runs/m1ilia77/overview) for this experiment.

*   **Expected Behavior (based on WandB):** Training loss should decrease steadily and reach a value around 0.4.
*   **Actual Behavior:** My training loss curve follows a trajectory more similar to the example curve shown in the paper's Figure 9. The loss seems to decrease in steps and eventually settles at a lower value (i.e., lower than 0.1).

**Request:**
Could you clarify if there are known differences in configuration or setup between the run shown in the WandB report (achieving ~0.4 loss) and the conditions that might lead to the curve shape shown in the appendix? Any guidance on replicating the ~0.4 loss result would be appreciated.

Thank you for your time and assistance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Reproducing s1.1-32B Training Loss (Observed vs. WandB) #108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue Reproducing s1.1-32B Training Loss (Observed vs. WandB) #108

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions