Closed
Description
Currently vLLM's compilation tool uses PyTorch's extension builders, which calls Ninja under the hood. This works okay but have the following issues:
- Only supports NVIDIA and AMD GPUs.
- Slow sequential builds. This is amplified by adding quantization kernels and LoRA kernels.
- No caching and incremental builds.
We would liked to ask for community's help on recommending a technology, prototype, and implement it. Ideally something like CMake or Bazel could work but it requires some careful thinking.
The requirements:
- Must support multiple hardware architecture (NVIDIA, AMD, Intel, etc).
- Must support incremental build, which also implies caching.
- Must support parallelizable build.
- Good to have editor support (by generating compilation database).
- Ideally it would not OOM like current setup. Currently due to the rigid structure, we have to carefully set
MAX_JOBS
andNVCC_THREADS
to get around compiler goes out of memory. I think this is because nvcc spawn threads for each SM architecture we are compiling to. - vaguely, "future proof".
Currently, the "build system" is all in here https://github.com/vllm-project/vllm/blob/main/setup.py