Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cmake based build system #2830

Merged
merged 76 commits into from
Mar 18, 2024
Merged

Cmake based build system #2830

merged 76 commits into from
Mar 18, 2024

Conversation

bnellnm
Copy link
Contributor

@bnellnm bnellnm commented Feb 10, 2024

This PR replaces the setup.py extension build system with a cmake build system. I've tried to preserve and move all the existing logic about flags and supported architectures into the CMakeLists.txt file so that it is standalone, i.e. you should be able to build the C++/CUDA artifacts without going thru python or setup.py. All the new setup.py does is select which extensions to build and the threading level for the nvcc compiler. The cmake build should also honor most of the same environment variables as setup.py. Some notable exceptions are NVCC_THREADS and VLLM_ENABLE_PUNICA_KERNELS which are replaced by command line cmake defines, e.g. cmake ... -DNVCC_THREADS=8.

The CMakeLists.txt defines three targets:_C, _moe_C and _punica_C which correspond to the names of the object files that were previously generated by the old setup.py.

There's also a default target which triggers the proper combination of _C, _moe_C and _punica_C based on the platform, architecture and VLLM_ENABLE_PUNICA_KERNELS variable. This makes standalone use of cmake easier since the user doesn't need to know which specific extension targets need to be built for the current platform, e.g.

mkdir build && cd build
cmake -G Ninja -DVLLM_PYTHON_EXECUTABLE=$(which python3) -DVLLM_INSTALL_PUNICA_KERNELS=1 -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=../vllm ..
cmake --build . --target default  # will build _C, _moe_C and _punica_C if appropriate.

The CMakeLists.txt logic is a duplicate of the logic in setup.py that is used to select the extensions to build and should be kept in sync.

Some notable cmake variables for controlling versions:

  • PYTHON_SUPPORTED_VERSIONS - lists the python versions supported by vllm. If a new version of python is needed (or an old one removed), this variable should be updated.
  • CUDA_SUPPORTED_ARCHS - a semi-colon separated list (i.e. a cmake list) of supported NVIDIA architectures. (this used to be NVIDIA_SUPPORTED_ARCHS in setup.py)
  • HIP_SUPPORTED_ARCHS - a cmake list of supported AMD/ROCM architectures. (this used to be ROCM_SUPPORTED_ARCHS in setup.py)

The source files for each extension are controlled by the following variables which must be updated if the list of files changes (note that globs are not used since this is not recommended practice for cmake):

  • VLLM_EXT_SRC for the _C extension.
  • VLLM_MOE_EXT_SRC for the _moe_C extension.
  • VLLM_PUNICA_EXT_SRC for the _punica_C extension.

Incremental builds should be supported "out of the box" but ccache will also be used if it is present. The build can also be done with traditional makefiles if ninja isn't installed. I've tried to balance the build parallelism with number of nvcc threads but this is a heuristic that might need to be tweaked.

The build flavor is controlled by the CMAKE_BUILD_TYPE environment variable with the default being RelWithDebInfo. To change the default, edit the cmake_build_ext class in setup.py.

None of the build systems I looked at (cmake, meson, bazel) had any specific support for other GPU architectures. Although the popularity of cmake probably makes it most likely to get support for new platforms.

I've tried to comment the CMakeLists.txt file with a description of the configuration process. There were a few quirky bits that needed special handling. In particular, setting the CUDA architecture flags required a somewhat convoluted process due to how torch/cmake interact.

For a ROCm build, there is an additional "hipify" preprocessor step that is run on the CUDA source files. The hipify.py script is a command line wrapper around pytorch's hipify function that is used by CMakeLists.txt. For uniformity, I've hooked the preprocessor up for all targets (not just _C) even though _moe_C and _punica_C currently aren't compiled (or supported?) for ROCm.

resolves #2654

@bnellnm bnellnm force-pushed the cmake branch 2 times, most recently from 4cb0650 to e9ae52d Compare February 11, 2024 03:02
@bnellnm bnellnm changed the title Cmake [WIP] Cmake Feb 11, 2024
@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Feb 12, 2024

@bnellnm I have access to some AMD GPUs (mi-210/mi-250) if we need to do any testing there, just lmk

@bnellnm bnellnm force-pushed the cmake branch 2 times, most recently from 4212328 to 1f1d8c3 Compare February 22, 2024 06:30
@bnellnm bnellnm changed the title [WIP] Cmake Cmake based build system Feb 22, 2024
@bnellnm bnellnm marked this pull request as ready for review February 22, 2024 19:14
@lroberts7
Copy link

@bnellnm is there a build command for nvidia gpu with this new setup?

I'd like to try the build on our a100 to confirm I can still build once this is merged so I can continue to test and provide comments on open PRs.

I'm currently able to build on tip of main. Do I need to set any specific environment variables or should something like the usual:

mdkir build && cd build
cmake ..

work?

@bnellnm
Copy link
Contributor Author

bnellnm commented Feb 22, 2024

@bnellnm is there a build command for nvidia gpu with this new setup?

I'd like to try the build on our a100 to confirm I can still build once this is merged so I can continue to test and provide comments on open PRs.

I'm currently able to build on tip of main. Do I need to set any specific environment variables or should something like the usual:

mdkir build && cd build
cmake ..

work?

You should be able to use cmake directly w/o going thru python. The cmake build should honor the same pytorch architecture environment variables, e.g. TORCH_CUDA_ARCH_LIST. I just verified that the following works:

mkdir build && cd build
cmake -G Ninja ..
cmake --build . --target _C

The only caveat is that cmake will try to build all the extensions (_C, _moe_C, _punica_C) by default, so if you only want to build one, e.g. _C you'll have to specify that as a target.

@bnellnm bnellnm force-pushed the cmake branch 2 times, most recently from 85a76a5 to 97ca1cf Compare February 26, 2024 18:01
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename supported to default?

setup.py Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated
set(VLLM_PUNICA_GPU_ARCHES)

# Process each `gencode` flag.
foreach(ARCH ${VLLM_CUDA_ARCH_FLAGS})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you so much for supporting the existing workflow of VLLM_CUDA_ARCH_FLAGS. greatly appreciated!

@simon-mo
Copy link
Collaborator

This looks great to me overall. @pcmoritz please do a detail pass.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@bnellnm
Copy link
Contributor Author

bnellnm commented Feb 29, 2024

@pcmoritz, @simon-mo I had a question about the punica kernels setup. It looks like the existing setup.py is checking the physical CUDA devices to see if they supported (>= 8.0) rather than just depending on whatever TORCH_CUDA_ARCH_LIST was set to. This seems wrong if we are cross compiling? It seems to me that checking VLLM_INSTALL_PUNICA_KERNELS and making sure that the target arch list for punica only contains arches >= 8.0 would be more correct?

I was trying to replicate the logic from setup.py in the CMakeFile.txt and was having some trouble and noticed that in the buildkite docker there don't seem to be any GPUs, e.g. from a recent log (i.e. No CUDA devices found)

#24 5.315 -- CUDA supported arches: 7.0;7.5;8.0;8.6;8.9;9.0
#24 5.315 -- CUDA target arches: 70;75;80;86;89;90;90-virtual
#24 7.002 -- Punica target arches: 80;86;89;90;90-virtual
#24 7.002 -- Enabling C extension.
#24 7.002 -- Enabling moe extension.
#24 7.002 -- arches: OFF
#24 7.002 -- native arches: No CUDA devices found.-real
#24 7.002 -- ARCH_VER_STR:
#24 7.002 -- Unable to add punica extension due to device version  < 8.0.
#24 7.002 CMake Error at CMakeLists.txt:275 (string_to_ver):
#24 7.002   string_to_ver Macro invoked with incorrect arguments for macro named:
#24 7.002   string_to_ver

The loop in setup.py that checks the device capabilities never gets run in this situation either.

@pcmoritz
Copy link
Collaborator

pcmoritz commented Mar 4, 2024

@bnellnm Thanks for all the updates, I have one more questions: Have you looked into whether the hippification can be done without duplicating the logic from pytorch? How does the pytorch based CMake build system do it? If we can do it the same way without poking through the abstractions, it will very likely be more maintainable going forward.

On the punica question, I don't exactly know -- your logic seems simpler. I think we need to make sure that these two things keep working:

  • Punica will work for the architectures that it is supported on for the release wheels
  • For developers / people who compile vllm themselves, if they switch on the VLLM_INSTALL_PUNICA_KERNELS flag and their hardware supports punica, it will just work

@bnellnm
Copy link
Contributor Author

bnellnm commented Mar 4, 2024

@bnellnm Thanks for all the updates, I have one more questions: Have you looked into whether the hippification can be done without duplicating the logic from pytorch? How does the pytorch based CMake build system do it? If we can do it the same way without poking through the abstractions, it will very likely be more maintainable going forward.

On the punica question, I don't exactly know -- your logic seems simpler. I think we need to make sure that these two things keep working:

  • Punica will work for the architectures that it is supported on for the release wheels
  • For developers / people who compile vllm themselves, if they switch on the VLLM_INSTALL_PUNICA_KERNELS flag and their hardware supports punica, it will just work

@pcmoritz Thanks for all the good feedback.

The hipify.py script I added is CLI wrapper around the "hipification" process used by pytorch's extension system. When building through python it is invoked from torch.utils.cpp_extension.CUDAExtension directly and not through cmake. The code in pytorch appears to be a copy (or modified copy) of https://github.com/ROCm/hipify_torch. I decided to go thru pytorch's version of hipify since I didn't want to add an extra external dependency and I didn't know if there had been any customization or special dependency on the existing version of the code.

As for the punica kernels, I'm pretty sure they will work as expected since the build now follows the same process as the other kernels, with the exception of filtering out additional unsupported architectures. And manually setting TORCH_CUDA_ARCH_LIST will allow compiling the punica kernels no matter what the version is for the underlying hardware. If it turns out that there are no supported architectures in TORCH_CUDA_ARCH_LIST or in the detected architectures, then the build will fail.

@pcmoritz
Copy link
Collaborator

pcmoritz commented Mar 6, 2024

Here are some benchmarks -- these all look very reasonable:

On this branch, time pip install -e . takes

real    3m55.800s
user    11m1.568s
sys     1m15.713s

On master, time pip install -e . takes

real    3m33.912s
user    10m48.451s
sys     0m54.333s

Incremental compilation on this branch touch csrc/cache_kernels.cu && time python setup.py develop:

real    0m58.882s
user    0m57.193s
sys     0m27.270s

and incremental compilation with cmake: touch csrc/cache_kernels.cu && time cmake --build . --target default

real    0m37.769s
user    0m34.976s
sys     0m2.763s

On master, touch csrc/cache_kernels.cu && time python setup.py develop:

real    0m46.619s
user    0m42.561s
sys     0m12.813s

@simon-mo simon-mo enabled auto-merge (squash) March 18, 2024 20:35
@simon-mo simon-mo disabled auto-merge March 18, 2024 21:15
@simon-mo simon-mo merged commit 9fdf3de into vllm-project:main Mar 18, 2024
31 checks passed
@hliuca
Copy link
Contributor

hliuca commented Mar 18, 2024

The changes in setup.py messed up ROCm/HIP support.

@simon-mo
Copy link
Collaborator

Hi @hliuca, this PR compiled on AMD machine. Can you elaborate on the issue?

@hliuca
Copy link
Contributor

hliuca commented Mar 18, 2024

hon-310/vllm/model_executor/layers/fused_moe/configs
running build_ext
Traceback (most recent call last):
File "/workdir/vllm-mlperf/setup.py", line 338, in
setup(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/init.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
return run_commands(dist)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
dist.run_commands()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
self.run_command(cmd)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/install.py", line 85, in run
self.do_egg_install()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/install.py", line 137, in do_egg_install
self.run_command('bdist_egg')
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 167, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/bdist_egg.py", line 153, in call_command
self.run_command(cmdname)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/command/install_lib.py", line 111, in build
self.run_command('build_ext')
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/dist.py", line 963, in run_command
super().run_command(command)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
cmd_obj.run()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 89, in run
_build_ext.run(self)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 345, in run
self.build_extensions()
File "/workdir/vllm-mlperf/setup.py", line 157, in build_extensions
self.configure(ext)
File "/workdir/vllm-mlperf/setup.py", line 125, in configure
num_jobs, nvcc_threads = self.compute_num_jobs()
File "/workdir/vllm-mlperf/setup.py", line 64, in compute_num_jobs
nvcc_cuda_version = get_nvcc_cuda_version()
File "/workdir/vllm-mlperf/setup.py", line 237, in get_nvcc_cuda_version
nvcc_output = subprocess.check_output([CUDA_HOME + "/bin/nvcc", "-V"],
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

@hliuca
Copy link
Contributor

hliuca commented Mar 18, 2024

def _is_cuda() -> bool:
return (torch.version.cuda is not None) and not _is_neuron()

was changed to
def _is_cuda() -> bool:
return torch.version.cuda is not None

@hliuca
Copy link
Contributor

hliuca commented Mar 18, 2024

the following function checks nvcc, and it doesn't seem to check HIP.

class cmake_build_ext(build_ext):
# A dict of extension directories that have been configured.
did_config = {}

#
# Determine number of compilation jobs and optionally nvcc compile threads.
#
def compute_num_jobs(self):
    try:
        # os.sched_getaffinity() isn't universally available, so fall back
        # to os.cpu_count() if we get an error here.
        num_jobs = len(os.sched_getaffinity(0))
    except AttributeError:
        num_jobs = os.cpu_count()

    nvcc_cuda_version = get_nvcc_cuda_version()
    if nvcc_cuda_version >= Version("11.2"):
        nvcc_threads = int(os.getenv("NVCC_THREADS", 8))
        num_jobs = max(1, round(num_jobs / (nvcc_threads / 4)))
    else:
        nvcc_threads = None

    return num_jobs, nvcc_threads

@bnellnm
Copy link
Contributor Author

bnellnm commented Mar 18, 2024

the following function checks nvcc, and it doesn't seem to check HIP.

class cmake_build_ext(build_ext): # A dict of extension directories that have been configured. did_config = {}

#
# Determine number of compilation jobs and optionally nvcc compile threads.
#
def compute_num_jobs(self):
    try:
        # os.sched_getaffinity() isn't universally available, so fall back
        # to os.cpu_count() if we get an error here.
        num_jobs = len(os.sched_getaffinity(0))
    except AttributeError:
        num_jobs = os.cpu_count()

    nvcc_cuda_version = get_nvcc_cuda_version()
    if nvcc_cuda_version >= Version("11.2"):
        nvcc_threads = int(os.getenv("NVCC_THREADS", 8))
        num_jobs = max(1, round(num_jobs / (nvcc_threads / 4)))
    else:
        nvcc_threads = None

    return num_jobs, nvcc_threads

I think a check for _is_cuda() in compute_num_jobs should fix the problem.

@hliuca
Copy link
Contributor

hliuca commented Mar 18, 2024

the following function checks nvcc, and it doesn't seem to check HIP.
class cmake_build_ext(build_ext): # A dict of extension directories that have been configured. did_config = {}

#
# Determine number of compilation jobs and optionally nvcc compile threads.
#
def compute_num_jobs(self):
    try:
        # os.sched_getaffinity() isn't universally available, so fall back
        # to os.cpu_count() if we get an error here.
        num_jobs = len(os.sched_getaffinity(0))
    except AttributeError:
        num_jobs = os.cpu_count()

    nvcc_cuda_version = get_nvcc_cuda_version()
    if nvcc_cuda_version >= Version("11.2"):
        nvcc_threads = int(os.getenv("NVCC_THREADS", 8))
        num_jobs = max(1, round(num_jobs / (nvcc_threads / 4)))
    else:
        nvcc_threads = None

    return num_jobs, nvcc_threads

I think a check for _is_cuda() in compute_num_jobs should fix the problem.

yes. also, _is_cuda should be restored... new version doesn't work :-)

def _is_cuda() -> bool:
return (torch.version.cuda is not None) and not _is_neuron()

@bnellnm
Copy link
Contributor Author

bnellnm commented Mar 18, 2024

the following function checks nvcc, and it doesn't seem to check HIP.
class cmake_build_ext(build_ext): # A dict of extension directories that have been configured. did_config = {}

#
# Determine number of compilation jobs and optionally nvcc compile threads.
#
def compute_num_jobs(self):
    try:
        # os.sched_getaffinity() isn't universally available, so fall back
        # to os.cpu_count() if we get an error here.
        num_jobs = len(os.sched_getaffinity(0))
    except AttributeError:
        num_jobs = os.cpu_count()

    nvcc_cuda_version = get_nvcc_cuda_version()
    if nvcc_cuda_version >= Version("11.2"):
        nvcc_threads = int(os.getenv("NVCC_THREADS", 8))
        num_jobs = max(1, round(num_jobs / (nvcc_threads / 4)))
    else:
        nvcc_threads = None

    return num_jobs, nvcc_threads

I think a check for _is_cuda() in compute_num_jobs should fix the problem.

yes. also, _is_cuda should be restored... new version doesn't work :-)

def _is_cuda() -> bool: return (torch.version.cuda is not None) and not _is_neuron()

This PR didn't change the definition of _is_cuda(), that was from a prior PR #2671

@bnellnm
Copy link
Contributor Author

bnellnm commented Mar 18, 2024

I think this PR will fix the problem #3481, cc @simon-mo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call for Help: Proper Build System (CMake, Bazel, etc).
7 participants