[RFC]: Add support for IBM Spyre accelerator

### Motivation.

IBM has recently [announced](https://wccftech.com/ibm-telum-ii-processor-spyre-ai-accelerator-8-cores-5-5-ghz-360-mb-cache/) its Spyre AI accelerator at Hot Chips 2024. This accelerator has been designed, in collaboration with IBM Research,  to scale-up enterprise AI workloads running on IBM's mainframe systems (IBM Z), as well as on IBM's [Power platform](https://www.ibm.com/docs/en/announcements/statement-direction-spyre-accelerator-power-platform). Since IBM is building our [inference stack on top of vLLM](https://developer.ibm.com/articles/awb-the-open-source-ecosystem-of-watsonx/), we would like to enable support for IBM Spyre within the vLLM framework. 

Spyre has been designed to fit seamlessly into the PyTorch ecosystem via torch.compile. Specifically, IBM Research has developed a new backend for torch.compile that will compile torch FX graphs for execution on the Spyre hardware. In this sense, we envision that Spyre support in vLLM can work in a similar way to how the TPU support is working today (e.g., see [here](https://github.com/vllm-project/vllm/blob/main/vllm/worker/tpu_model_runner.py#L685-L688)).

Today, there are two key limitations that affect this integration and need to be worked around. Specifically:
1. Spyre only supports execution of the modeling code from IBM's open-source [Foundation Model Stack](https://github.com/foundation-model-stack/foundation-model-stack) (`fms`). This is only a temporary limitation, the end-goal is that, via torch.compile, we can also run the native vLLM modeling code on Spyre. We hope that recent efforts to make vLLM models torch compilable will significantly accelerate this effort. 
2. Spyre does not currently support paged attention or continuous batching. This means that, after prefilling a batch, we need to keep decoding until all of the sequences within that batch are finished. Our team is working to remove this limitation in the near future. 

### Proposed Change.

In this RFC, we propose the following sequence of PRs to enable IBM Spyre support in vLLM:
- [ ] P1: Add support for single Spyre card via new `SpyreExecutor` class
- [ ] P2: Changes to scheduling algorithm to disable continuous batching.
- [ ] P3: Enable TP execution across multiple Spyre cards via `MultiprocessingSpyreExecutor` class.
- [ ] P4: Enable paged attention and continuous batching for Spyre. 
- [ ] P5: Enable vLLM modeling code to run on Spyre. 

While much of the work here (P1, P2, P3) has already been completed in an private fork, we plan to upstream the changes as a sequence of smaller PRs, to make the changes easier to review. Below we will discuss the planned changes from each PR step. 

**P1: Add support for single Spyre card via new `SpyreExecutor` class**

We will introduce a set of classes, inheriting from the core vLLM classes, that will enable execution on a single Spyre device. Architecturally, this will look very similar to the equivalent classes that were introduced for running on AWS Inferentia (e.g., `NeuronExecutor`, `NeuronWorker`, `NeuronModelRunner`, `NeuronCausalLM`). In a similar way to how the `NeuronModelRunner` use the modeling code from the `transformers_neuronx` package, these new classes will execute the modeling code from IBM's `fms` package. 

In the diagram below, we compare the proposed Spyre classes, with the corresponding classes that already exist for the AWS Inferentia support: 

<img width="1135" alt="image" src="https://github.com/user-attachments/assets/98f55213-9a3d-445f-a37b-7cde3d32ce91">

Since Spyre works via torch.compile, to ensure that compilation does not occur on the critical path (e.g., serving user requests), we need to ensure that all compilation gets triggered at init time. This PR will also introduce a routine for _warming up_ the inference server when using the Spyre, triggering compilation of all required shapes (e.g., prompt length, number of output tokens, batch size). We will write code to ensure that batches get padded to one of the compiled shapes before execution. This behaviour is akin to what happens today in vLLM for CUDA graphs, and presumably something like this warmup will also be needed once vLLM starts using torch.compile more extensively. This could be one area to explore commonality with others parts of the codebase. 

_Testing_: While testing on the real hardware can only be performed internally for now, we can test the vast majority of the integration on the CPU by either (a) running in eager mode or (b) by using torch compile with the `inductor` backend. Thus, in this PR we will also add a set of unit and integration tests to verify that everything behaves as expected. The tests will focus on the offline mode, since changes to the scheduling algorithm are needed to support online mode (see P2). We will also add a `Dockerfile.spyre` containing all necessary dependencies (e.g., FMS) in which the tests can be executed. Whether we could have these tests running as part of vLLM's CI/CD is something we would like to discuss. 

**P2: Changes to scheduling algorithm to disable continuous batching.**

We need to introduce a few changes to the scheduling algorithm to workaround the lack of continuous batching support. Specifically:
1. We must not schedule another prefill until all decodes in the running batch are finished (one line change). 
2. We need to introduce some logic to decide how to batch together request based on the prompt lengths and max output tokens, in order to best fit the shapes that have been compiled on Spyre during the warmup phase. 

These changes must be conditional and not affect the behaviour of the scheduler on existing supported devices. They could either be applied within the scheduler itself (e.g., by checking `is_spyre()`) or we could try to "[plug in](https://github.com/vllm-project/vllm/issues/7131)" an alternate scheduler design? This is one of the design choices we would like some feedback on. 

_Testing_: As part of this PR, we will also introduce tests to cover the integration with the `MQLLMEngine` and online operation. 

**P3:  Enable TP execution across multiple Spyre cards via `MultiprocessingSpyreExecutor` class.**

We have found that the `MultiprocessingGPUExecutor` can be easily adapted into a `MultiprocessingSpyreExecutor` to enable TP execution across multiple Spyre devices in parallel. However, to reduce code duplication we propose refactoring the common code between `MultiprocessingGPUExecutor` and `MultiprocessingSpyreExecutor` into a common parent class `MultiprocessingExecutor`. By inheriting from `MultiprocessingExecutor` and a corresponding mixin class (e.g., `GPUExecutor` or `SpyreExecutor`) it should be possible to achieve the desired behaviour with very little device-specific code. Note that something along these lines already exists for the `MultiprocessingXPUExecutor` (e.g., see [here](https://github.com/vllm-project/vllm/blob/main/vllm/executor/multiproc_xpu_executor.py#L11)), but the design proposed below would give more flexibility for device-specific specialization, and would also easily allow us to create multi-processing executors for all support devices if we want. 

The architecture would look something like this:

<img width="1093" alt="image" src="https://github.com/user-attachments/assets/f3c17b7b-7d21-4f8c-a73f-0bcfe4c1870b">

_Testing_: We will add tests to verify that the `MultiprocessingSpyreExecutor` behaves as expected for tensor parallel execution when running using eager or using`inductor` backend on CPU. Internally, we will run these tests against the real hardware. 

**P4: Enable paged attention and continuous batching for Spyre.**

TBD

**P5: Enable vLLM modeling code to run on Spyre.**

TBD



### Feedback Period.

2 weeks

### CC List.

@njhill @simon-mo @youkaichao @zhuohan123 @comaniac @WoosukKwon @Yard1 

Please cc anyone else as you see fit! 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Add support for IBM Spyre accelerator #9652

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Add support for IBM Spyre accelerator #9652

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions