-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
Hi, I'm trying to run DeepSeek-V3 GRPO using verl with sglang backend.
Image and code
whatcanyousee/verl:ngc-cu124-vllm0.8.5-sglang0.4.6-mcore0.12.0-te2.3
Launch configs
train_files=$gsm8k_train_path
test_files=$gsm8k_test_path
ALL_OFFLOAD=${ALL_OFFLOAD:-True}
COMMON_PARAM_OFFLOAD=${COMMON_PARAM_OFFLOAD:-$ALL_OFFLOAD}
COMMON_GRAD_OFFLOAD=${COMMON_GRAD_OFFLOAD:-$ALL_OFFLOAD}
COMMON_OPTIMIZER_OFFLOAD=${COMMON_OPTIMIZER_OFFLOAD:-$ALL_OFFLOAD}
ACTOR_PARAM_OFFLOAD=${ACTOR_PARAM_OFFLOAD:-$COMMON_PARAM_OFFLOAD}
ACTOR_GRAD_OFFLOAD=${ACTOR_GRAD_OFFLOAD:-$COMMON_GRAD_OFFLOAD}
ACTOR_OPTIMIZER_OFFLOAD=${ACTOR_OPTIMIZER_OFFLOAD:-$COMMON_OPTIMIZER_OFFLOAD}
REF_PARAM_OFFLOAD=${REF_PARAM_OFFLOAD:-$COMMON_PARAM_OFFLOAD}
CRITIC_PARAM_OFFLOAD=${CRITIC_PARAM_OFFLOAD:-$COMMON_PARAM_OFFLOAD}
CRITIC_GRAD_OFFLOAD=${CRITIC_GRAD_OFFLOAD:-$COMMON_GRAD_OFFLOAD}
CRITIC_OPTIMIZER_OFFLOAD=${CRITIC_OPTIMIZER_OFFLOAD:-$COMMON_OPTIMIZER_OFFLOAD}
RM_PARAM_OFFLOAD=${RM_PARAM_OFFLOAD:-$COMMON_PARAM_OFFLOAD}
NODES=32
PP=16
TP=1
EP=16
ETP=1
INFER_TP=16
n_resp_per_prompt=4
# RAY_ADDRESS='auto' ray job submit --working-dir . --
python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
algorithm.adv_estimator=grpo \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.train_batch_size=512 \
data.max_prompt_length=1024 \
data.max_response_length=512 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=$LLM \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.8 \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.temperature=1.0 \
actor_rollout_ref.rollout.top_p=1.0 \
actor_rollout_ref.rollout.top_k=-1 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$INFER_TP \
+actor_rollout_ref.rollout.engine_kwargs.sglang.enable_torch_compile=False \
+actor_rollout_ref.rollout.engine_kwargs.sglang.cuda_graph_max_bs=2 \
algorithm.use_kl_in_reward=False \
trainer.logger=['console','tensorboard'] \
trainer.project_name='verl_megatron_gsm8k_examples' \
trainer.experiment_name='dsv3-32nodes' \
trainer.n_gpus_per_node=8 \
trainer.nnodes=$NODES \
trainer.save_freq=-1 \
trainer.test_freq=5 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=4 \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=1 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=$PP \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=$PP \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=$TP \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=$TP \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.actor.megatron.param_offload=${ACTOR_PARAM_OFFLOAD} \
actor_rollout_ref.actor.megatron.optimizer_offload=${ACTOR_OPTIMIZER_OFFLOAD} \
actor_rollout_ref.actor.megatron.grad_offload=${ACTOR_GRAD_OFFLOAD} \
actor_rollout_ref.ref.megatron.param_offload=${REF_PARAM_OFFLOAD} \
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
trainer.default_hdfs_dir=null \
trainer.default_local_dir=$CKPT_DIR \
trainer.val_before_train=False \
trainer.val_before_train=False \
trainer.total_epochs=100 $@Issues
- Using verl's original code, sglang raises
RuntimeError: Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).
gpu_memory_utilizationcan only set to0.8, otherwise it will fail. Need a more proper configuration list for sglang server.- I tried to fix 1 by detaching tensors in https://github.com/volcengine/verl/blob/96903e0e97af6f645610bbe45f255a6c76f46a89/verl/utils/megatron_utils.py#L723-L821 (though I think it should be fixed in the sglang side), but still faces performance issues when verl tries to
update_weights_from_tensor.
Not much useful logs, I'm seeing many logs like
Cloning 16 replicas of the shared expert into MoE: 0%| | 0/58 [00:00<?, ?it/s]^MCloning 16 replicas of the shared expert into MoE: 0%| | 0/58 [00:00<?, ?it/s]^MCloning 16 replicas of the shared expert into MoE: 0%| | 0/58 [00:00<?, ?it/s]^MCloning 16 replicas of the shared expert into MoE: 0%| | 0/58 [00:00<?, ?it/s]^MCloning 16 replicas of the shared expert into MoE: 100%|██████████| 58/58 [00:00<00:00, 701065.22it/s]^MCloning 16 replicas of the shared expert into MoE: 100%|██████████| 58/58 [00:00<00:00, 638502.97it/s]^MCloning 16 replicas of the shared expert into MoE: 100%|██████████| 58/58 [00:00<00:00, 443114.08it/s]
for several hours before
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Related resources
No response
hebiao064
Metadata
Metadata
Assignees
Labels
No labels