[Bug] Out of shared-memory's hardware limit when serving Qwen3-Next in vLLM

### Describe the bug

How to Reproduce:
1. export USE_FLAGGEMS=1 USE_C_EXTENSION=1
2. vllm serve Qwen/Qwen3-Coder-Next --tensor-parallel-size 4 --max-num-batched-tokens 16384 --max-num-seqs 2048 --served-model-name qwencf --port 9011
3. vllm bench serve would probabilistically trigger out-of-shared-memory exception. especially when concurrency >= 128, but no bugs on CUDA
4. FlagEval benchmarks would consistently but at random locations trigger out-of-shared-memory exception. 
The origin error log is: 

```
[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 774, in run^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 490, in launch_metadata^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     self._init_handles()^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 464, in _init_handles^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     raise_(OutOfResources(self.metadata.shared, max_shared, "shared memory"))^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 456, in raise_^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824]     raise err^M
^[[0;36m(Worker_TP0 pid=46727)^[[0;0m ERROR 02-09 08:08:26 [multiproc_executor.py:824] triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 294936, Hardware limit: 232448. Reducing block sizes or `num_stages` may help.^M
```

FlagGems+triton3.5.0 would triggered similar problem. What could be the cause?

### Environment details

1. H20*4
2. nvcr.io/nvidia/pytorch:25.12-py3
and install libs one by one:
3. vllm==0.13.0
4. flag-gems 0574a3db0929, compiled with cpp wrapper(libtriton_jit)
5. flagtree==0.4.0+3.5
6. vllm-plugin-fl 37db70d1061b99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Out of shared-memory's hardware limit when serving Qwen3-Next in vLLM #352

Describe the bug

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Out of shared-memory's hardware limit when serving Qwen3-Next in vLLM #352

Description

Describe the bug

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions