Skip to content

[Kernel] vLLM Windows CUDA support #14891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

SystemPanic
Copy link

@SystemPanic SystemPanic commented Mar 16, 2025

This PR fixes #2309 #2242 #2242 #669 #5086 #5631 #1685 #179 #2309 #669 and includes:

  • vLLM CUDA support for Windows (with updated Python code, install setup and compiled Kernels)
  • Add compatibility with Pytorch nightly / source compilation builds (changed torch version detection to major / minor)

FlashInfer build for Windows will be added today in a PR in their corresponding repository.

Kernels changes are the minimal to make the MSVC compiler happy and doesn't change any kernel functionality itself. Special mention to GPTQ_Marlin, where the excessive ELSE IF clauses inside a single function has been splitted into smaller functions to avoid a C1061 error.

Instructions for Windows build:

A Visual Studio 2019 or newer is required to launch the compiler x64 environment. The installation path is referred in the instructions as VISUAL_STUDIO_INSTALL_PATH.

CUDA path will be found automatically if you have the bin folder in your PATH, or have the CUDA installation path settled on well-known environment vars like CUDA_ROOT, CUDA_HOME or CUDA_PATH.

If none of these are present, make sure to set the environment variable before starting the build:
set CUDA_ROOT=CUDA_INSTALLATION_PATH

  1. Open a Command Line (cmd.exe)
  2. Clone the vLLM repository (for example, to C:\vllm)
  3. Execute (in cmd) VISUAL_STUDIO_INSTALL_PATH\VC\Auxiliary\Build\vcvarsall.bat x64
  4. Change the working directory to the cloned repository path, for example: cd C:\vllm
  5. Set the following variables:
set DISTUTILS_USE_SDK=1
set VLLM_TARGET_DEVICE=cuda
set MAX_JOBS=10 (or your desired number to speed up compilation)

#Optional variables:

#To include cuDSS (only if you have cuDSS installed)
set USE_CUDSS=1
set CUDSS_LIBRARY_PATH=PATH_TO_CUDSS_INSTALL_DIR\lib\12
set CUDSS_INCLUDE_PATH=PATH_TO_CUDSS_INSTALL_DIR\include

#To include cuSPARSELt (only if you have cuSPARSELt installed)
set CUSPARSELT_INCLUDE_PATH=PATH_TO_CUSPARSELT_INSTALL_DIR\include 
set USE_CUSPARSELT=1

#To include cuDNN:
set USE_CUDNN=1

#Flash Attention v3 build has been disabled inside WSL2 and Windows due to compiler being killed on WSL2, and extremely long compiling times on Windows. Hopper is not available on Windows, so FA3 has no sense anyway. 
#Build can be forcefully enabled using the following environment var:
set VLLM_FORCE_FA3_WINDOWS_BUILD=1

The next steps are the same as building from source with GPU support at section "Full build (with compilation)": https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#full-build-with-compilation

As a note, some sm100 kernels in Cutlass v3.8.0 has compilation errors on Windows. A fix has been submitted to Nvidia (see NVIDIA/cutlass#2167).

Until Nvidia accepts the PR, only for Windows environment, FetchContent_Declare will clone Cutlass v3.8.0 from a branch with the fix. Feel free to remove that part when the Nvidia PR has been merged (keep the rest of changes for cuBLAS and VLLM_GPU_FLAGS).


FIX #2309
FIX #2242
FIX #669
FIX #5086
FIX #5631
FIX #1685
FIX #179
FIX #2309

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having windows support is nice. Do we have plan to introduce windows CI? Otherwise, it's easy to get it broken again.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
@SystemPanic
Copy link
Author

Having windows support is nice. Do we have plan to introduce windows CI? Otherwise, it's easy to get it broken again.

Yes, I agreed. I don't expect an implementation to be difficult using Buildkite, it has support for Windows self-hosted agents for private servers and AWS EC2 instances.

Another way is using Github hosted runners only for Windows. An example optimized workflow with CUDA 12.4: karpathy/llm.c#401

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to block this PR for an RFC discussion first, mainly to answer the questions of maintenance burden and feature popularity. Do we expect users to use vLLM on Windows? How heavy is the maintenance and porting cost?

@houseroad
Copy link
Collaborator

The key is some commitment on the Windows support for the long run. Otherwise, it's very easy to run into some broken status.

@SystemPanic
Copy link
Author

I would like to block this PR for an RFC discussion first, mainly to answer the questions of maintenance burden and feature popularity. Do we expect users to use vLLM on Windows? How heavy is the maintenance and porting cost?

Hi @simon-mo

Done.

I'm quite busy and don't have the time right now to discuss the RFC, so I'll let the community to debate and show the interest for this.

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see vLLM support Windows, but I'm afraid that modification on kernels to support MSVC toolchain will increase the costs when porting new kernels.

I wonder if windows GNU toolchain can compile the kernels without these kernel modification?

Comment on lines +342 to +345
if platform.system() == "Windows":
parent_process.send_signal(signal.SIGTERM)
else:
parent_process.send_signal(signal.SIGUSR1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a helper function to send terminate signal compatible with windows and unix, instead of having if platform.system() == "Windows": ... else: ... everywhere.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
@SystemPanic
Copy link
Author

While the PR / RFC is discussed, I will release Windows wheels from time to time in the project fork

The project fork doesn't mean official support, as I don't have the time to keep all the changes up to date for new wheels.

To show the interest for vLLM on Windows, join to the discussion at the RFC

@Panchovix
Copy link

Panchovix commented Mar 22, 2025

While the PR / RFC is discussed, I will release Windows wheels from time to time in the project fork

The project fork doesn't mean official support, as I don't have the time to keep all the changes up to date for new wheels.

To show the interest for vLLM on Windows, join to the discussion at the RFC

This is amazing, many thanks! I will try to build with my system (Blackwell 2.0 + Ada Lovelace + Ampere) on Windows and will comment how it goes.

@yurii-sio2
Copy link

My vote for support Windows version.

Copy link

mergify bot commented Apr 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SystemPanic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2025
@AlpinDale
Copy link
Contributor

I tried doing this a while ago at aphrodite-engine/aphrodite-engine#790. It was too much work, and the maintenance burden was too great to keep it working for future releases. I generally don't recommend. Users have better options (e.g. llamacpp or one of its forks) for Windows. I also noticed that WSL was faster than native Windows execution, so that makes this effort even more redundant, imo.

@SystemPanic
Copy link
Author

@AlpinDale

I disagree.

No one is requesting support for each commit done to the repo, only for releases.

I maintain the Windows fork, and fixing incompatibilities when vLLM team publish a new release take me, depending the quantity of changes, 1-2 hours.

It would be significantly less, like 30 minutes per release, or no job required at all, if the fork is merged and only build errors are fixed, instead of dealing with git conflicts on barely touched files for each release, over and over again.

Considering 4 releases per month, that is less than 2 hours of work per month.

I understand that vLLM team has limited resources and need to focus on new characteristics and bug fixing to stay competitive, considering the fast evolving field that AI is and the crazyness of new models released each week.

I also noticed that WSL was faster than native Windows execution

That's not true, not for the Windows vLLM fork.

I will close this PR, as is outdated and discussion is moved to an RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants