[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1

### Motivation.

Mamba, SSM, and hybrid transformer models are an important path forward towards models that scale linearly with sequence length. vLLM currently supports many models of this class ([Jamba](https://github.com/vllm-project/vllm/pull/4115), [Mamba](https://github.com/vllm-project/vllm/pull/6484), [Codestral Mamba](https://github.com/vllm-project/vllm/pull/9292), [Falcon Mamba](https://github.com/vllm-project/vllm/pull/9325), [Bamba](https://github.com/vllm-project/vllm/pull/10909), [Zamba2](https://github.com/vllm-project/vllm/pull/13185), [MinimaxText01](https://github.com/vllm-project/vllm/pull/13454), [Plamo2](https://github.com/vllm-project/vllm/pull/14323)), and should continue to maintain excellent support for these models.

**The Problem**
SSM model generally are less-well supported than transformers in vLLM, and have several deficiencies.
This RFC proposes several improvements (some already in progress) to SSM models, and additionally will serve as an issue tracker.

The major issue is that SSM models not supported in vLLM V1, and should be supported before V0 is deprecated.
In addition:
* [SSM state management](https://github.com/vllm-project/vllm/blob/05e1fbfc52ca575e6539de63dbb5fab929683162/vllm/model_executor/models/constant_size_cache.py) is a little hacky and is managed by the model definition.
* Since the SSM state is not managed by the block manager, SSM models are incompatible with prefix caching, KV cache offloading, and prefill-decode disaggregation.
* There are major performance issues with chunked prefill.

### Proposed Change.

**Blockers for SSM and hybrid model support vLLM V1**
- [x] Hybrid Allocator: https://github.com/vllm-project/vllm/issues/11382 (initial work is targeted towards sliding-window attention)
- [ ] Once the hybrid allocator is landed, extend it to support SSM and hybrid models
- [ ] torch.compile support (needed for piecewise CUDA graphs)

**Other improvements**
- [ ] Extend Mamba support beyond CUDA GPUs
- [x] Improve performance for chunked prefill
    - [x] https://github.com/vllm-project/vllm/pull/16942
    - [x] https://github.com/vllm-project/vllm/pull/17146
- [ ] Support quantization + tensor parallel in Mamba2 https://github.com/vllm-project/vllm/issues/14618

### Feedback Period.

_No response_

### CC List.

@fabianlim @cyang49 @mzusman @yury-tokpanov 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions