Skip to content

[Bug] Out of shared-memory's hardware limit when serving Qwen3-Next in vLLM #352

@shh2000

Description

@shh2000

Describe the bug

How to Reproduce:

  1. export USE_FLAGGEMS=1 USE_C_EXTENSION=1
  2. vllm serve Qwen/Qwen3-Coder-Next --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-num-seqs 2048 --served-model-name qwencf --port 9011
  3. vllm bench serve would probabilistically trigger out-of-shared-memory exception. especially when concurrency >= 128, but no bugs on CUDA
  4. FlagEval benchmarks would consistently but at random locations trigger out-of-shared-memory exception.
    The origin error log is:
[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 774, in run^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in launch_metadata^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     self._init_handles()^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 464, in _init_handles^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     raise_(OutOfResources(self.metadata.shared, max_shared, "shared memory"))^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 456, in raise_^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     raise err^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 294936, Hardware limit: 232448. Reducing block sizes or `num_stages` may help.^M

FlagGems+triton3.5.0 would triggered similar problem. What could be the cause?

Environment details

  1. H20*4
  2. nvcr.io/nvidia/pytorch:25.12-py3
    and install libs one by one:
  3. vllm==0.13.0
  4. flag-gems 0574a3db0929, compiled with cpp wrapper(libtriton_jit)
  5. flagtree==0.4.0+3.5
  6. vllm-plugin-fl 37db70d1061b99

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions