Skip to content

[CI] [Flaky test] distributed/test_shm_broadcast.py is flaky #5848

@cadedaniel

Description

@cadedaniel

Anything you want to discuss about vllm.

Distributed comm ops test failed with below stacktrace. Buildkite

[2024-06-25T12:58:33Z] distributed/test_shm_broadcast.py:72:
--
  | [2024-06-25T12:58:33Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z] fn = <function worker_fn_wrapper.<locals>.wrapped_fn at 0x7f8cc92afa30>
  | [2024-06-25T12:58:33Z] world_size = 4
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]     def distributed_run(fn, world_size):
  | [2024-06-25T12:58:33Z]         number_of_processes = world_size
  | [2024-06-25T12:58:33Z]         processes = []
  | [2024-06-25T12:58:33Z]         for i in range(number_of_processes):
  | [2024-06-25T12:58:33Z]             env = {}
  | [2024-06-25T12:58:33Z]             env['RANK'] = str(i)
  | [2024-06-25T12:58:33Z]             env['LOCAL_RANK'] = str(i)
  | [2024-06-25T12:58:33Z]             env['WORLD_SIZE'] = str(number_of_processes)
  | [2024-06-25T12:58:33Z]             env['LOCAL_WORLD_SIZE'] = str(number_of_processes)
  | [2024-06-25T12:58:33Z]             env['MASTER_ADDR'] = 'localhost'
  | [2024-06-25T12:58:33Z]             env['MASTER_PORT'] = '12345'
  | [2024-06-25T12:58:33Z]             p = multiprocessing.Process(target=fn, args=(env, ))
  | [2024-06-25T12:58:33Z]             processes.append(p)
  | [2024-06-25T12:58:33Z]             p.start()
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]         for p in processes:
  | [2024-06-25T12:58:33Z]             p.join()
  | [2024-06-25T12:58:33Z]
  | [2024-06-25T12:58:33Z]         for p in processes:
  | [2024-06-25T12:58:33Z] >           assert p.exitcode == 0
  | [2024-06-25T12:58:33Z] E           AssertionError: assert 1 == 0
  | [2024-06-25T12:58:33Z] E            +  where 1 = <Process name='Process-1' pid=15885 parent=7 stopped exitcode=1>.exitcode

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions