-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
How to Reproduce:
- export USE_FLAGGEMS=1 USE_C_EXTENSION=1
- vllm serve Qwen/Qwen3-Coder-Next --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-num-seqs 2048 --served-model-name qwencf --port 9011
- vllm bench serve would probabilistically trigger out-of-shared-memory exception. especially when concurrency >= 128, but no bugs on CUDA
- FlagEval benchmarks would consistently but at random locations trigger out-of-shared-memory exception.
The origin error log is:
[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 774, in run^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in launch_metadata^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] self._init_handles()^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 464, in _init_handles^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] raise_(OutOfResources(self.metadata.shared, max_shared, "shared memory"))^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 456, in raise_^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] raise err^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 294936, Hardware limit: 232448. Reducing block sizes or `num_stages` may help.^M
FlagGems+triton3.5.0 would triggered similar problem. What could be the cause?
Environment details
- H20*4
- nvcr.io/nvidia/pytorch:25.12-py3
and install libs one by one: - vllm==0.13.0
- flag-gems 0574a3db0929, compiled with cpp wrapper(libtriton_jit)
- flagtree==0.4.0+3.5
- vllm-plugin-fl 37db70d1061b99
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working