Description
Motivation.
IBM has recently announced its Spyre AI accelerator at Hot Chips 2024. This accelerator has been designed, in collaboration with IBM Research, to scale-up enterprise AI workloads running on IBM's mainframe systems (IBM Z), as well as on IBM's Power platform. Since IBM is building our inference stack on top of vLLM, we would like to enable support for IBM Spyre within the vLLM framework.
Spyre has been designed to fit seamlessly into the PyTorch ecosystem via torch.compile. Specifically, IBM Research has developed a new backend for torch.compile that will compile torch FX graphs for execution on the Spyre hardware. In this sense, we envision that Spyre support in vLLM can work in a similar way to how the TPU support is working today (e.g., see here).
Today, there are two key limitations that affect this integration and need to be worked around. Specifically:
- Spyre only supports execution of the modeling code from IBM's open-source Foundation Model Stack (
fms
). This is only a temporary limitation, the end-goal is that, via torch.compile, we can also run the native vLLM modeling code on Spyre. We hope that recent efforts to make vLLM models torch compilable will significantly accelerate this effort. - Spyre does not currently support paged attention or continuous batching. This means that, after prefilling a batch, we need to keep decoding until all of the sequences within that batch are finished. Our team is working to remove this limitation in the near future.
Proposed Change.
In this RFC, we propose the following sequence of PRs to enable IBM Spyre support in vLLM:
- P1: Add support for single Spyre card via new
SpyreExecutor
class - P2: Changes to scheduling algorithm to disable continuous batching.
- P3: Enable TP execution across multiple Spyre cards via
MultiprocessingSpyreExecutor
class. - P4: Enable paged attention and continuous batching for Spyre.
- P5: Enable vLLM modeling code to run on Spyre.
While much of the work here (P1, P2, P3) has already been completed in an private fork, we plan to upstream the changes as a sequence of smaller PRs, to make the changes easier to review. Below we will discuss the planned changes from each PR step.
P1: Add support for single Spyre card via new SpyreExecutor
class
We will introduce a set of classes, inheriting from the core vLLM classes, that will enable execution on a single Spyre device. Architecturally, this will look very similar to the equivalent classes that were introduced for running on AWS Inferentia (e.g., NeuronExecutor
, NeuronWorker
, NeuronModelRunner
, NeuronCausalLM
). In a similar way to how the NeuronModelRunner
use the modeling code from the transformers_neuronx
package, these new classes will execute the modeling code from IBM's fms
package.
In the diagram below, we compare the proposed Spyre classes, with the corresponding classes that already exist for the AWS Inferentia support:

Since Spyre works via torch.compile, to ensure that compilation does not occur on the critical path (e.g., serving user requests), we need to ensure that all compilation gets triggered at init time. This PR will also introduce a routine for warming up the inference server when using the Spyre, triggering compilation of all required shapes (e.g., prompt length, number of output tokens, batch size). We will write code to ensure that batches get padded to one of the compiled shapes before execution. This behaviour is akin to what happens today in vLLM for CUDA graphs, and presumably something like this warmup will also be needed once vLLM starts using torch.compile more extensively. This could be one area to explore commonality with others parts of the codebase.
Testing: While testing on the real hardware can only be performed internally for now, we can test the vast majority of the integration on the CPU by either (a) running in eager mode or (b) by using torch compile with the inductor
backend. Thus, in this PR we will also add a set of unit and integration tests to verify that everything behaves as expected. The tests will focus on the offline mode, since changes to the scheduling algorithm are needed to support online mode (see P2). We will also add a Dockerfile.spyre
containing all necessary dependencies (e.g., FMS) in which the tests can be executed. Whether we could have these tests running as part of vLLM's CI/CD is something we would like to discuss.
P2: Changes to scheduling algorithm to disable continuous batching.
We need to introduce a few changes to the scheduling algorithm to workaround the lack of continuous batching support. Specifically:
- We must not schedule another prefill until all decodes in the running batch are finished (one line change).
- We need to introduce some logic to decide how to batch together request based on the prompt lengths and max output tokens, in order to best fit the shapes that have been compiled on Spyre during the warmup phase.
These changes must be conditional and not affect the behaviour of the scheduler on existing supported devices. They could either be applied within the scheduler itself (e.g., by checking is_spyre()
) or we could try to "plug in" an alternate scheduler design? This is one of the design choices we would like some feedback on.
Testing: As part of this PR, we will also introduce tests to cover the integration with the MQLLMEngine
and online operation.
P3: Enable TP execution across multiple Spyre cards via MultiprocessingSpyreExecutor
class.
We have found that the MultiprocessingGPUExecutor
can be easily adapted into a MultiprocessingSpyreExecutor
to enable TP execution across multiple Spyre devices in parallel. However, to reduce code duplication we propose refactoring the common code between MultiprocessingGPUExecutor
and MultiprocessingSpyreExecutor
into a common parent class MultiprocessingExecutor
. By inheriting from MultiprocessingExecutor
and a corresponding mixin class (e.g., GPUExecutor
or SpyreExecutor
) it should be possible to achieve the desired behaviour with very little device-specific code. Note that something along these lines already exists for the MultiprocessingXPUExecutor
(e.g., see here), but the design proposed below would give more flexibility for device-specific specialization, and would also easily allow us to create multi-processing executors for all support devices if we want.
The architecture would look something like this:

Testing: We will add tests to verify that the MultiprocessingSpyreExecutor
behaves as expected for tensor parallel execution when running using eager or usinginductor
backend on CPU. Internally, we will run these tests against the real hardware.
P4: Enable paged attention and continuous batching for Spyre.
TBD
P5: Enable vLLM modeling code to run on Spyre.
TBD
Feedback Period.
2 weeks
CC List.
@njhill @simon-mo @youkaichao @zhuohan123 @comaniac @WoosukKwon @Yard1
Please cc anyone else as you see fit!
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.