Skip to content

[RFC] Initial Support for CPUs #3654

Closed
Closed
@bigPYJ1151

Description

@bigPYJ1151

Progress

Features

The CPU executor plans to support the following features:

  • Basic models of vLLM with FP16/BF16/FP32, except MoE models
  • Tensor-parallel model inference based on Ray
  • AWQ quantization, 8-bit KVCache Quantization
  • Others

Design

Our target is seamless porting vLLM to CPU devices and sharing most of vLLM core components (e.g., schedular, cache management, model definitions, Megatron-style model partitioning, ...).

The CPU executor will depend on Pytorch CPU and leverage optimized kernels and features from intel-extension-for-pytorch.

The main changes to vLLM include:

Torch APIs Adaption

CPU device is supported in PyTorch by default. It allows the CPU Executor to share the same model definitions with the GPU Executor. Thanks to recent code refactors, many hardcoded cuda device flags have been removed and Torch APIs are dispatched based on the device flag from DeviceConfig. For the CPU executor, a new cpu device flag is added.

Sharing the same model definitions and Torch APIs also allows the CPU executor to easily support new models and features in vLLM (e.g., torch.compile).

Custom Ops Adaption

vLLM implemented many efficient CUDA kernels and packaged as _C library by pybind. These kernels are ported to CPU using C++ and OpenMP, with the same function signatures to replace the CUDA kernels directly. The CPU custom kernel building procedure is integrated into vLLM CMake build system as a CMake module.

Currently, all of CPU kernels require AVX512 ISA support.

Python APIs Adaption

New CPUExecutor and CPUWorker are added to initialize the environment and model runner. The CPUModelRunner is derived from ModelRunner of the GPU code path, because most of the code could be shared. Even though it might have potential risks due to changes in the GPU code path, CPUModelRunner could fix them by rewriting configurations or overloading member functions easily.

In special, different from the GPU executor profiling available KV cache memory, the cache memory in the CPU executor is specified by the swap_space parameter. Because the memory management of CPU is more complex than GPU (e.g., NUMA).

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCunstaleRecieved activity after being labelled stalex86-cpuRelated to Intel & AMD CPU

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions