Description
What happened + What you expected to happen
When I run the script on H20 GPUs, Ray reports the following error:
�[36m(WorkerDict pid=129668)�[0m Fatal Python error: Floating point exception
�[36m(WorkerDict pid=129668)�[0m
�[36m(WorkerDict pid=129668)�[0m Stack (most recent call first):
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 40 in apply
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/logits_processor.py", line 83 in _get_logits
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/logits_processor.py", line 61 in forward
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 424 in compute_logits
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1698 in execute_model
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 116 in _wrapper
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/third_party/vllm/vllm_v_0_6_3/worker.py", line 267 in execute_model
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 163 in execute_model
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1386 in step
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 879 in _run_engine
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/third_party/vllm/vllm_v_0_6_3/llm.py", line 161 in _run_engine
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 353 in generate
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/utils.py", line 1063 in inner
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 175 in generate_sequences
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/workers/fsdp_workers.py", line 421 in generate_sequences
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/single_controller/base/decorator.py", line 404 in inner
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/single_controller/ray/base.py", line 399 in func
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/function_manager.py", line 696 in actor_method_executor
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 935 in main_loop
�[36m(WorkerDict pid=129668)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 297 in
�[36m(WorkerDict pid=130001)�[0m
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffff548ff83ed723f0f3ee7fe7301000000 Worker ID: 207613ad0e09b9bddfb08339ea9cea7c917d31cdcd013b943ec73c4a Node ID: 00f0d6f4148b7b054259cabd8800eda88411eb532636dd05a078c73c Worker IP address: 33.197.124.208 Worker port: 46715 Worker PID: 130001 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['data.train_files=/home/zhuoli.lb/code_linlin/TinyZero/data/countdown_instruct/train.parquet', 'data.val_files=/home/zhuoli.lb/code_linlin/TinyZero/data/countdown_instruct/test.parquet', 'data.train_batch_size=256', 'data.val_batch_size=1312', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=/home/zhuoli.lb/LLM_models/qwen3b-instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=128', 'actor_rollout_ref.actor.ppo_micro_batch_size=8', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size=4', 'critic.optim.lr=1e-5', 'critic.model.path=/home/zhuoli.lb/LLM_models/qwen3b-instruct', 'critic.ppo_micro_batch_size=8', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.save_freq=100', 'trainer.test_freq=100', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-3b-instruct', 'trainer.total_epochs=15']
�[36m(WorkerDict pid=129668)�[0m /home/zhuoli.lb/.conda/envs/zero/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
�[36m(WorkerDict pid=129668)�[0m warnings.warn('resource_tracker: There appear to be %d '
�[36m(WorkerDict pid=130001)�[0m *** SIGFPE received at time=1739261389 on cpu 28 ***
�[36m(WorkerDict pid=130001)�[0m PC: @ 0x7ed039242921 (unknown) (unknown)
�[36m(WorkerDict pid=130001)�[0m @ 0x7f00475ee100 (unknown) (unknown)
�[36m(WorkerDict pid=130001)�[0m [2025-02-11 16:09:49,867 E 130001 130001] logging.cc:460: *** SIGFPE received at time=1739261389 on cpu 28 ***
�[36m(WorkerDict pid=130001)�[0m [2025-02-11 16:09:49,867 E 130001 130001] logging.cc:460: PC: @ 0x7ed039242921 (unknown) (unknown)
�[36m(WorkerDict pid=130001)�[0m [2025-02-11 16:09:49,867 E 130001 130001] logging.cc:460: @ 0x7f00475ee100 (unknown) (unknown)
�[36m(WorkerDict pid=130001)�[0m Fatal Python error: Floating point exception
�[36m(WorkerDict pid=130001)�[0m Stack (most recent call first):
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 40 in apply
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/logits_processor.py", line 83 in _get_logits
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/layers/logits_processor.py", line 61 in forward
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562 in _call_impl
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553 in _wrapped_call_impl
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 424 in compute_logits
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/third_party/vllm/vllm_v_0_6_3/spmd_gpu_executor.py", line 163 in execute_model�[32m [repeated 3x across cluster]�[0m
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/worker/model_runner_base.py", line 116 in _wrapper
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 1386 in step
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/third_party/vllm/vllm_v_0_6_3/llm.py", line 161 in _run_engine�[32m [repeated 2x across cluster]�[0m
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 353 in generate
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/vllm/utils.py", line 1063 in inner
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/workers/rollout/vllm_rollout/vllm_rollout.py", line 175 in generate_sequences
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/workers/fsdp_workers.py", line 421 in generate_sequences
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/single_controller/base/decorator.py", line 404 in inner
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/code_linlin/TinyZero/verl/single_controller/ray/base.py", line 399 in func
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/function_manager.py", line 696 in actor_method_executor
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 935 in main_loop
�[36m(WorkerDict pid=130001)�[0m File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/workers/default_worker.py", line 297 in
Traceback (most recent call last):
File "/home/zhuoli.lb/code_linlin/TinyZero/verl/trainer/main_ppo.py", line 103, in main
ray.get(main_task.remote(config))
File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/zhuoli.lb/.conda/envs/zero/lib/python3.9/site-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): �[36mray::main_task()�[39m (pid=129027, ip=33.197.124.208)
File "/home/zhuoli.lb/code_linlin/TinyZero/verl/trainer/main_ppo.py", line 189, in main_task
trainer.fit()
File "/home/zhuoli.lb/code_linlin/TinyZero/verl/trainer/ppo/ray_trainer.py", line 589, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/home/zhuoli.lb/code_linlin/TinyZero/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: f548ff83ed723f0f3ee7fe7301000000
pid: 130001
name: HyzVTgWorkerDict_0:1
namespace: bfb6e12e-0fd1-462b-8e2a-500f13ac396b
ip: 33.197.124.208
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: fffffffffffffffffa4673f2a706c30ff98d176901000000 Worker ID: 8555992942dd0ee6787b516700b9350490481915eafb9d2be81005d9 Node ID: 00f0d6f4148b7b054259cabd8800eda88411eb532636dd05a078c73c Worker IP address: 33.197.124.208 Worker port: 41507 Worker PID: 129668 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
How to solve this problem? thanks for your help!
Versions / Dependencies
My system info is as followed:
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
Clang version: /bin/sh: clang: command not found
CMake version: cmake version 3.22.1
Libc version: ldd (GNU libc) 2.32
Python version: 3.9.21 (main, Dec 11 2024, 16:24:11)
[GCC 11.2.0]
Python platform: linux
Is CUDA available: True
CUDA runtime version: 12.1
CUDA_MODULE_LOADING set to: _LinalgBackend.Default
GPU models and configuration:
GPU 0: NVIDIA H20
GPU 1: NVIDIA H20
GPU 2: NVIDIA H20
GPU 3: NVIDIA H20
GPU 4: NVIDIA H20
GPU 5: NVIDIA H20
GPU 6: NVIDIA H20
GPU 7: NVIDIA H20
Nvidia driver version: 550.54.15
Reproduction script
https://github.com/Jiayi-Pan/TinyZero
Issue Severity
High: It blocks me from completing my task.
Activity