Skip to content

Ray OOM causes the process to be killed #429

Closed
@PKU-Fgx

Description

@PKU-Fgx

I found that as the training progressed, the System Memory Utilization (%) skyrocketed, and after a fixed point ray would report an out of memory error that would crash the training process.

  • Error

Image

  • System Memory Utilization (%)

Image

  • Script
set -x
MODEL_PATH=<local_path>
export VLLM_ATTENTION_BACKEND=XFORMERS
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=<local_path> \
    data.val_files=<local_path> \
    data.train_batch_size=64 \
    data.val_batch_size=64 \
    data.max_prompt_length=768 \
    data.max_response_length=3328 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=12288 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0. \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.72 \
    actor_rollout_ref.rollout.n=32 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=12288 \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=12288 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0. \
    trainer.critic_warmup=0 \
    trainer.logger=['wandb'] \
    trainer.project_name=<wandb> \
    trainer.experiment_name=<wandb> \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.default_local_dir=<local_path> \
    trainer.default_hdfs_dir=null \
    +trainer.val_before_train=False \
    trainer.save_freq=200 \
    trainer.test_freq=200 \
    trainer.total_epochs=3

Or is there some parameter I haven't configured correctly that is causing the memory usage to keep increasing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions