-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Closed
Labels
Description
Anything you want to discuss about vllm.
Distributed comm ops test failed with below stacktrace. Buildkite
[2024-06-25T12:58:33Z] distributed/test_shm_broadcast.py:72:
--
| [2024-06-25T12:58:33Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| [2024-06-25T12:58:33Z]
| [2024-06-25T12:58:33Z] fn = <function worker_fn_wrapper.<locals>.wrapped_fn at 0x7f8cc92afa30>
| [2024-06-25T12:58:33Z] world_size = 4
| [2024-06-25T12:58:33Z]
| [2024-06-25T12:58:33Z] def distributed_run(fn, world_size):
| [2024-06-25T12:58:33Z] number_of_processes = world_size
| [2024-06-25T12:58:33Z] processes = []
| [2024-06-25T12:58:33Z] for i in range(number_of_processes):
| [2024-06-25T12:58:33Z] env = {}
| [2024-06-25T12:58:33Z] env['RANK'] = str(i)
| [2024-06-25T12:58:33Z] env['LOCAL_RANK'] = str(i)
| [2024-06-25T12:58:33Z] env['WORLD_SIZE'] = str(number_of_processes)
| [2024-06-25T12:58:33Z] env['LOCAL_WORLD_SIZE'] = str(number_of_processes)
| [2024-06-25T12:58:33Z] env['MASTER_ADDR'] = 'localhost'
| [2024-06-25T12:58:33Z] env['MASTER_PORT'] = '12345'
| [2024-06-25T12:58:33Z] p = multiprocessing.Process(target=fn, args=(env, ))
| [2024-06-25T12:58:33Z] processes.append(p)
| [2024-06-25T12:58:33Z] p.start()
| [2024-06-25T12:58:33Z]
| [2024-06-25T12:58:33Z] for p in processes:
| [2024-06-25T12:58:33Z] p.join()
| [2024-06-25T12:58:33Z]
| [2024-06-25T12:58:33Z] for p in processes:
| [2024-06-25T12:58:33Z] > assert p.exitcode == 0
| [2024-06-25T12:58:33Z] E AssertionError: assert 1 == 0
| [2024-06-25T12:58:33Z] E + where 1 = <Process name='Process-1' pid=15885 parent=7 stopped exitcode=1>.exitcode