-
Couldn't load subscription status.
- Fork 0
use shared storage for megatron checkpointing #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
cae4421
d270e7a
f01a105
66ee7ed
c92d3ff
b28469b
d930989
578c3c1
b561135
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,28 @@ git clone https://github.com/NVIDIA/Megatron-LM.git Megatron-LM --branch core_r0 | |
| cd /root/ | ||
| export DATASET="zai-org/LongAlign-10k" | ||
| export MODEL_ID="Qwen/Qwen3-30B-A3B-Instruct-2507" | ||
| export CKPT_DIR=${BT_RW_CACHE_DIR}/${BT_TRAINING_JOB_ID} | ||
| mkdir -p $CKPT_DIR | ||
|
|
||
| # Begin setup of rsync | ||
| if ! command -v rsync &> /dev/null; then | ||
| echo "Installing rsync..." | ||
| apt-get update && apt-get install -y rsync | ||
| fi | ||
|
|
||
| # Set up rsync in the background to sync checkpoints to the checkpointing directory | ||
| if [[ "${BT_NODE_RANK}" == "0" ]]; then | ||
| echo "Setting up continuous rsync from shared file system to checkpointing directory" | ||
| # Start a background loop that continuously syncs | ||
| ( | ||
| while true; do | ||
| rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/ | ||
| sleep 30 # Sync every 30 seconds | ||
| done | ||
| ) & | ||
| RSYNC_PID=$! | ||
| echo "Continuous rsync started with PID: $RSYNC_PID" | ||
| fi | ||
|
Comment on lines
+26
to
+37
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We set up rsync in the background to move data from cache to checkpointing volume |
||
|
|
||
| export MCORE_MODEL_DIR="Converted/Qwen3-30B-A3B-Instruct-2507-mcore" | ||
| swift export \ | ||
|
|
@@ -24,43 +46,95 @@ swift export \ | |
|
|
||
| echo "Done converting ckpt" | ||
|
|
||
| PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True NPROC_PER_NODE=$BT_NUM_GPUS NNODES=$BT_GROUP_SIZE NODE_RANK=$BT_NODE_RANK MASTER_ADDR=$BT_LEADER_ADDR megatron sft \ | ||
| --load $MCORE_MODEL_DIR \ | ||
| --dataset $DATASET \ | ||
| --no_initialization false \ | ||
| --split_dataset_ratio 0.01 \ | ||
| --tensor_model_parallel_size 2 \ | ||
| --pipeline_model_parallel_size 2 \ | ||
| --expert_model_parallel_size 2 \ | ||
| --moe_permute_fusion true \ | ||
| --moe_grouped_gemm true \ | ||
| --moe_shared_expert_overlap true \ | ||
| --moe_aux_loss_coeff 1e-3 \ | ||
| --micro_batch_size 1 \ | ||
| --global_batch_size 8 \ | ||
| --packing true \ | ||
| --recompute_granularity full \ | ||
| --recompute_method uniform \ | ||
| --recompute_num_layers 4 \ | ||
| --train_iters 200 \ | ||
| --eval_iters 40 \ | ||
| --finetune true \ | ||
| --cross_entropy_loss_fusion true \ | ||
| --lr 1e-5 \ | ||
| --lr_warmup_fraction 0.05 \ | ||
| --min_lr 1e-6 \ | ||
| --save $BT_CHECKPOINT_DIR \ | ||
| --eval_interval 40 \ | ||
| --save_interval 40 \ | ||
| --max_length 32000 \ | ||
| --num_workers 8 \ | ||
| --dataset_num_proc 8 \ | ||
| --no_save_optim true \ | ||
| --no_save_rng true \ | ||
| --sequence_parallel true \ | ||
| --attention_backend flash \ | ||
| --optimizer_cpu_offload true \ | ||
| --use_precision_aware_optimizer true \ | ||
| --use_hf 1 \ | ||
| --wandb_project qwen3_moe_megatron \ | ||
| --wandb_exp_name all_training_b10f | ||
| run_megatron_training() { | ||
| PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True NPROC_PER_NODE=$BT_NUM_GPUS NNODES=$BT_GROUP_SIZE NODE_RANK=$BT_NODE_RANK MASTER_ADDR=$BT_LEADER_ADDR megatron sft \ | ||
| --load $MCORE_MODEL_DIR \ | ||
| --dataset $DATASET \ | ||
| --no_initialization false \ | ||
| --split_dataset_ratio 0.01 \ | ||
| --tensor_model_parallel_size 2 \ | ||
| --pipeline_model_parallel_size 2 \ | ||
| --expert_model_parallel_size 2 \ | ||
| --moe_permute_fusion true \ | ||
| --moe_grouped_gemm true \ | ||
| --moe_shared_expert_overlap true \ | ||
| --moe_aux_loss_coeff 1e-3 \ | ||
| --micro_batch_size $MICRO_BATCH_SIZE \ | ||
| --global_batch_size $GLOBAL_BATCH_SIZE \ | ||
| --packing true \ | ||
| --recompute_granularity full \ | ||
| --recompute_method uniform \ | ||
| --recompute_num_layers 4 \ | ||
| --train_iters 5 \ | ||
| --eval_iters 5 \ | ||
| --finetune true \ | ||
| --cross_entropy_loss_fusion true \ | ||
| --lr 1e-5 \ | ||
| --lr_warmup_fraction 0.05 \ | ||
| --min_lr 1e-6 \ | ||
| --save $CKPT_DIR \ | ||
| --eval_interval 5 \ | ||
| --save_interval 5 \ | ||
| --max_length 32000 \ | ||
| --num_workers 8 \ | ||
| --dataset_num_proc 8 \ | ||
| --no_save_optim true \ | ||
| --no_save_rng true \ | ||
| --sequence_parallel true \ | ||
| --attention_backend flash \ | ||
| --optimizer_cpu_offload true \ | ||
| --use_precision_aware_optimizer true \ | ||
| --use_hf 1 \ | ||
| --wandb_project qwen3_moe_megatron \ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| --wandb_exp_name $BT_TRAINING_JOB_NAME | ||
| } | ||
|
|
||
| set +e | ||
| run_megatron_training 2>&1 | tee training.log | ||
| EXIT_CODE=$? | ||
| set -e # Re-enable exit on error | ||
|
Comment on lines
+92
to
+95
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Megatron, in highly distributed workloads, might hang at the end - I did some searching, and looks like this isn't uncommon: (#1541, #735, #1207) So what we're doing here is we're allowing a non-zero exit code, capturing it, and then "re-setting" error detection. Piping to training.log was just something Claude suggested - i don't know the utility |
||
|
|
||
| echo "Training completed with exit code: $EXIT_CODE" | ||
|
|
||
|
|
||
| if [[ "${BT_NODE_RANK}" == "0" ]]; then | ||
| echo "Stopping continuous rsync and performing final synchronization..." | ||
|
|
||
| # Kill the continuous rsync process | ||
| if [[ -n "$RSYNC_PID" ]]; then | ||
| echo "Killing continuous rsync process (PID: $RSYNC_PID)" | ||
| kill $RSYNC_PID 2>/dev/null || true | ||
| # Wait a moment for the process to terminate | ||
| sleep 2 | ||
| fi | ||
|
|
||
| # Perform final synchronization to ensure everything is synced | ||
| echo "Performing final rsync..." | ||
| rsync -avz --delete $CKPT_DIR/ $BT_CHECKPOINT_DIR/ | ||
|
Comment on lines
+111
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make sure everything syncd by doing a blocking call to sync the directories |
||
|
|
||
| echo "Uploading checkpoints to hub..." | ||
| pushd $CKPT_DIR | ||
| ls -la | ||
| V0_DIR=$(echo v0-*) | ||
| popd | ||
| echo "V0_DIR: $V0_DIR" | ||
|
Comment on lines
+116
to
+120
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some middle-man directory that has datestamp for the training run One checkpoint: training_dir/v0-20250925-123459/iter_000040/.... training_dir/v0-20250925-123459/iter_000080/.... this is us figuring out "v0-20250925-123459" |
||
| swift export \ | ||
| --mcore_model "${CKPT_DIR}/${V0_DIR}" \ | ||
| --to_hf true \ | ||
| --torch_dtype bfloat16 \ | ||
| --output_dir megatron_output/hf_converted \ | ||
| --push_to_hub true \ | ||
| --hub_token $HF_TOKEN \ | ||
| --hub_model_id rayraycano/megatron-qwen3-30b-a3b | ||
|
|
||
| echo "Final synchronization complete!" | ||
|
Comment on lines
+121
to
+130
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we convert the weights back to hf mode - but let's avoid uploading to HF Repository |
||
| # Optionally clear out cache. Set this in your config.py | ||
| if [[ "${SHOULD_CLEAR_CACHE}" == "true" ]]; then | ||
| echo "Clearing out cache..." | ||
| rm -rf $CKPT_DIR | ||
| fi | ||
| else | ||
| echo "Worker waiting for leader to rsync..." | ||
| sleep infinity | ||
| fi | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the checkpoint dir is in the cache, and namespaced to the training job to ensure that the next job doesn't overwrite the data here