-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Initial Support for CPUs #3654
Comments
Thanks for yout excellent work! Looking forward to supporting inference on ARM CPU. Further, support for ray distributed computing |
could you give me a cpu inference example?
But I got error. Are there any engine arguments that need to be added here? |
@bigPYJ1151 Are you planning to support AVX/AVX2 to enable a broader range of Intel/x86 CPUs? |
Hi @mgiessing, it is not in our plan right now, but we may add it after the basic features finished. |
Could you help with #4415? I was trying to compile cool with intel compiler but I had some issues and I think I almost have it working. |
Hi @bigPYJ1151, I'd like to ask why the initial CPU support defines device specific vector types in https://github.com/vllm-project/vllm/blob/main/csrc/cpu/cpu_types_x86.hpp? PyTorch contains a vector type Vectorized that appears to serve the same purpose, while also being architecture agnostic. Could the custom ops for CPU switch to using this PyTorch type to make the CPU backend architecture agnostic? (i.e. PowerPC, AArch64, etc.) |
Hi @hmellor Yes, Pytorch contains such vector structures, and it is feasible to use them in the CPU backend. I didn't aware them before so defined the custom types🤣. vLLM is adapting torch compile and some custom ops will be generated by JIT, so the number of custom types will be very limited after we clean them. Then we can try to replace them with the Pytorch vectors. |
That's great to hear! Is it just #7110 that we're waiting for, or are there other PRs? |
Yes, after the #7110 I think we can do some code refactors. |
Progress
Features
The CPU executor plans to support the following features:
Design
Our target is seamless porting vLLM to CPU devices and sharing most of vLLM core components (e.g., schedular, cache management, model definitions, Megatron-style model partitioning, ...).
The CPU executor will depend on Pytorch CPU and leverage optimized kernels and features from intel-extension-for-pytorch.
The main changes to vLLM include:
Torch APIs Adaption
CPU device is supported in PyTorch by default. It allows the CPU Executor to share the same model definitions with the GPU Executor. Thanks to recent code refactors, many hardcoded
cuda
device flags have been removed and Torch APIs are dispatched based on the device flag fromDeviceConfig
. For the CPU executor, a newcpu
device flag is added.Sharing the same model definitions and Torch APIs also allows the CPU executor to easily support new models and features in vLLM (e.g.,
torch.compile
).Custom Ops Adaption
vLLM implemented many efficient CUDA kernels and packaged as
_C
library by pybind. These kernels are ported to CPU using C++ and OpenMP, with the same function signatures to replace the CUDA kernels directly. The CPU custom kernel building procedure is integrated into vLLM CMake build system as a CMake module.Currently, all of CPU kernels require
AVX512
ISA support.Python APIs Adaption
New
CPUExecutor
andCPUWorker
are added to initialize the environment and model runner. TheCPUModelRunner
is derived fromModelRunner
of the GPU code path, because most of the code could be shared. Even though it might have potential risks due to changes in the GPU code path,CPUModelRunner
could fix them by rewriting configurations or overloading member functions easily.In special, different from the GPU executor profiling available KV cache memory, the cache memory in the CPU executor is specified by the
swap_space
parameter. Because the memory management of CPU is more complex than GPU (e.g., NUMA).The text was updated successfully, but these errors were encountered: