Skip to content

Call for Help: Proper Build System (CMake, Bazel, etc).  #2654

Closed
@simon-mo

Description

@simon-mo

Currently vLLM's compilation tool uses PyTorch's extension builders, which calls Ninja under the hood. This works okay but have the following issues:

  • Only supports NVIDIA and AMD GPUs.
  • Slow sequential builds. This is amplified by adding quantization kernels and LoRA kernels.
  • No caching and incremental builds.

We would liked to ask for community's help on recommending a technology, prototype, and implement it. Ideally something like CMake or Bazel could work but it requires some careful thinking.

The requirements:

  • Must support multiple hardware architecture (NVIDIA, AMD, Intel, etc).
  • Must support incremental build, which also implies caching.
  • Must support parallelizable build.
  • Good to have editor support (by generating compilation database).
  • Ideally it would not OOM like current setup. Currently due to the rigid structure, we have to carefully set MAX_JOBS and NVCC_THREADS to get around compiler goes out of memory. I think this is because nvcc spawn threads for each SM architecture we are compiling to.
  • vaguely, "future proof".

Currently, the "build system" is all in here https://github.com/vllm-project/vllm/blob/main/setup.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions