Description
Progress
- Integrate CPU executor to support the basic model inference (BF16/FP32) without TP.
- [Hardware][Intel] Add CPU inference backend #3634
- [Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance #3824
- [CI/BUILD] enable intel queue for longer CPU tests #4113
- [Hardware][Intel] Optimize CPU backend and add more performance tips #4971
- [Hardware][Intel] Support CPU inference with AVX2 ISA #5452
- [Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446
- Support FP16 model inference.
- Support TP inference for multiple CPU sockets inside the same node.
- Support model and KV cache quantization.
Features
The CPU executor plans to support the following features:
- Basic models of vLLM with FP16/BF16/FP32, except MoE models
- Tensor-parallel model inference based on Ray
- AWQ quantization, 8-bit KVCache Quantization
- Others
Design
Our target is seamless porting vLLM to CPU devices and sharing most of vLLM core components (e.g., schedular, cache management, model definitions, Megatron-style model partitioning, ...).
The CPU executor will depend on Pytorch CPU and leverage optimized kernels and features from intel-extension-for-pytorch.
The main changes to vLLM include:
Torch APIs Adaption
CPU device is supported in PyTorch by default. It allows the CPU Executor to share the same model definitions with the GPU Executor. Thanks to recent code refactors, many hardcoded cuda
device flags have been removed and Torch APIs are dispatched based on the device flag from DeviceConfig
. For the CPU executor, a new cpu
device flag is added.
Sharing the same model definitions and Torch APIs also allows the CPU executor to easily support new models and features in vLLM (e.g., torch.compile
).
Custom Ops Adaption
vLLM implemented many efficient CUDA kernels and packaged as _C
library by pybind. These kernels are ported to CPU using C++ and OpenMP, with the same function signatures to replace the CUDA kernels directly. The CPU custom kernel building procedure is integrated into vLLM CMake build system as a CMake module.
Currently, all of CPU kernels require AVX512
ISA support.
Python APIs Adaption
New CPUExecutor
and CPUWorker
are added to initialize the environment and model runner. The CPUModelRunner
is derived from ModelRunner
of the GPU code path, because most of the code could be shared. Even though it might have potential risks due to changes in the GPU code path, CPUModelRunner
could fix them by rewriting configurations or overloading member functions easily.
In special, different from the GPU executor profiling available KV cache memory, the cache memory in the CPU executor is specified by the swap_space
parameter. Because the memory management of CPU is more complex than GPU (e.g., NUMA).