[Kernel] vLLM Windows CUDA support #14891

SystemPanic · 2025-03-16T15:19:59Z

This PR fixes #2309 #2242 #2242 #669 #5086 #5631 #1685 #179 #2309 #669 and includes:

vLLM CUDA support for Windows (with updated Python code, install setup and compiled Kernels)
Add compatibility with Pytorch nightly / source compilation builds (changed torch version detection to major / minor)

FlashInfer build for Windows will be added today in a PR in their corresponding repository.

Kernels changes are the minimal to make the MSVC compiler happy and doesn't change any kernel functionality itself. Special mention to GPTQ_Marlin, where the excessive ELSE IF clauses inside a single function has been splitted into smaller functions to avoid a C1061 error.

Instructions for Windows build:

A Visual Studio 2019 or newer is required to launch the compiler x64 environment. The installation path is referred in the instructions as VISUAL_STUDIO_INSTALL_PATH.

CUDA path will be found automatically if you have the bin folder in your PATH, or have the CUDA installation path settled on well-known environment vars like CUDA_ROOT, CUDA_HOME or CUDA_PATH.

If none of these are present, make sure to set the environment variable before starting the build:
set CUDA_ROOT=CUDA_INSTALLATION_PATH

Open a Command Line (cmd.exe)
Clone the vLLM repository (for example, to C:\vllm)
Execute (in cmd) VISUAL_STUDIO_INSTALL_PATH\VC\Auxiliary\Build\vcvarsall.bat x64
Change the working directory to the cloned repository path, for example: cd C:\vllm
Set the following variables:

set DISTUTILS_USE_SDK=1
set VLLM_TARGET_DEVICE=cuda
set MAX_JOBS=10 (or your desired number to speed up compilation)

#Optional variables:

#To include cuDSS (only if you have cuDSS installed)
set USE_CUDSS=1
set CUDSS_LIBRARY_PATH=PATH_TO_CUDSS_INSTALL_DIR\lib\12
set CUDSS_INCLUDE_PATH=PATH_TO_CUDSS_INSTALL_DIR\include

#To include cuSPARSELt (only if you have cuSPARSELt installed)
set CUSPARSELT_INCLUDE_PATH=PATH_TO_CUSPARSELT_INSTALL_DIR\include 
set USE_CUSPARSELT=1

#To include cuDNN:
set USE_CUDNN=1

#Flash Attention v3 build has been disabled inside WSL2 and Windows due to compiler being killed on WSL2, and extremely long compiling times on Windows. Hopper is not available on Windows, so FA3 has no sense anyway. 
#Build can be forcefully enabled using the following environment var:
set VLLM_FORCE_FA3_WINDOWS_BUILD=1

The next steps are the same as building from source with GPU support at section "Full build (with compilation)": https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#full-build-with-compilation

As a note, some sm100 kernels in Cutlass v3.8.0 has compilation errors on Windows. A fix has been submitted to Nvidia (see NVIDIA/cutlass#2167).

Until Nvidia accepts the PR, only for Windows environment, FetchContent_Declare will clone Cutlass v3.8.0 from a branch with the fix. Feel free to remove that part when the Nvidia PR has been merged (keep the rest of changes for cuBLAS and VLLM_GPU_FLAGS).

FIX #2309
FIX #2242
FIX #669
FIX #5086
FIX #5631
FIX #1685
FIX #179
FIX #2309

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

github-actions · 2025-03-16T15:20:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

houseroad

Having windows support is nice. Do we have plan to introduce windows CI? Otherwise, it's easy to get it broken again.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

SystemPanic · 2025-03-17T03:15:32Z

Having windows support is nice. Do we have plan to introduce windows CI? Otherwise, it's easy to get it broken again.

Yes, I agreed. I don't expect an implementation to be difficult using Buildkite, it has support for Windows self-hosted agents for private servers and AWS EC2 instances.

Another way is using Github hosted runners only for Windows. An example optimized workflow with CUDA 12.4: karpathy/llm.c#401

simon-mo

I would like to block this PR for an RFC discussion first, mainly to answer the questions of maintenance burden and feature popularity. Do we expect users to use vLLM on Windows? How heavy is the maintenance and porting cost?

houseroad · 2025-03-17T05:30:20Z

The key is some commitment on the Windows support for the long run. Otherwise, it's very easy to run into some broken status.

SystemPanic · 2025-03-17T19:40:31Z

I would like to block this PR for an RFC discussion first, mainly to answer the questions of maintenance burden and feature popularity. Do we expect users to use vLLM on Windows? How heavy is the maintenance and porting cost?

Hi @simon-mo

Done.

I'm quite busy and don't have the time right now to discuss the RFC, so I'll let the community to debate and show the interest for this.

Isotr0py

I'm glad to see vLLM support Windows, but I'm afraid that modification on kernels to support MSVC toolchain will increase the costs when porting new kernels.

I wonder if windows GNU toolchain can compile the kernels without these kernel modification?

Isotr0py · 2025-03-18T05:58:27Z

vllm/v1/engine/core.py

+            if platform.system() == "Windows":
+                parent_process.send_signal(signal.SIGTERM)
+            else:
+                parent_process.send_signal(signal.SIGUSR1)


I think we should have a helper function to send terminate signal compatible with windows and unix, instead of having if platform.system() == "Windows": ... else: ... everywhere.

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

SystemPanic · 2025-03-22T04:40:09Z

While the PR / RFC is discussed, I will release Windows wheels from time to time in the project fork

The project fork doesn't mean official support, as I don't have the time to keep all the changes up to date for new wheels.

To show the interest for vLLM on Windows, join to the discussion at the RFC

Panchovix · 2025-03-22T04:45:19Z

While the PR / RFC is discussed, I will release Windows wheels from time to time in the project fork

The project fork doesn't mean official support, as I don't have the time to keep all the changes up to date for new wheels.

To show the interest for vLLM on Windows, join to the discussion at the RFC

This is amazing, many thanks! I will try to build with my system (Blackwell 2.0 + Ada Lovelace + Ampere) on Windows and will comment how it goes.

yurii-sio2 · 2025-03-25T00:04:18Z

My vote for support Windows version.

mergify · 2025-04-08T02:13:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SystemPanic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

AlpinDale · 2025-05-10T01:53:03Z

I tried doing this a while ago at aphrodite-engine/aphrodite-engine#790. It was too much work, and the maintenance burden was too great to keep it working for future releases. I generally don't recommend. Users have better options (e.g. llamacpp or one of its forks) for Windows. I also noticed that WSL was faster than native Windows execution, so that makes this effort even more redundant, imo.

SystemPanic · 2025-05-10T21:04:12Z

@AlpinDale

I disagree.

No one is requesting support for each commit done to the repo, only for releases.

I maintain the Windows fork, and fixing incompatibilities when vLLM team publish a new release take me, depending the quantity of changes, 1-2 hours.

It would be significantly less, like 30 minutes per release, or no job required at all, if the fork is merged and only build errors are fixed, instead of dealing with git conflicts on barely touched files for each release, over and over again.

Considering 4 releases per month, that is less than 2 hours of work per month.

I understand that vLLM team has limited resources and need to focus on new characteristics and bug fixing to stay competitive, considering the fast evolving field that AI is and the crazyness of new models released each week.

I also noticed that WSL was faster than native Windows execution

That's not true, not for the Windows vLLM fork.

I will close this PR, as is outdated and discussion is moved to an RFC.

vLLM Windows CUDA support

e019fd6

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

SystemPanic requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, tlrmchlsmth, zhuohan123 and youkaichao as code owners March 16, 2025 15:20

mergify bot added ci/build frontend v1 labels Mar 16, 2025

houseroad reviewed Mar 16, 2025

View reviewed changes

SystemPanic added 2 commits March 16, 2025 21:46

Make pre-commit happy

25ee83e

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

Make pre-commit happy

3f24b19

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

simon-mo requested changes Mar 17, 2025

View reviewed changes

SystemPanic mentioned this pull request Mar 17, 2025

[RFC]: vLLM Windows CUDA support #14981

Closed

1 task

Isotr0py reviewed Mar 18, 2025

View reviewed changes

SystemPanic added 2 commits March 21, 2025 22:52

Fix OpenAI & cli server

a4be6dd

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

Fix OpenAI & cli server

5e4e5a0

Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>

mergify bot added the needs-rebase label Apr 8, 2025

organics2016 mentioned this pull request Apr 8, 2025

[QST] Try running unsloth on Windows and RTX5090 unslothai/unsloth#2313

Closed

zestybaby mentioned this pull request Apr 11, 2025

请问大家是windows在运行还是 Ubuntu？ HuiResearch/FlashTTS#18

Open

SystemPanic closed this May 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] vLLM Windows CUDA support #14891

[Kernel] vLLM Windows CUDA support #14891

SystemPanic commented Mar 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

houseroad left a comment

Uh oh!

SystemPanic commented Mar 17, 2025

Uh oh!

simon-mo left a comment

Uh oh!

houseroad commented Mar 17, 2025

Uh oh!

SystemPanic commented Mar 17, 2025

Uh oh!

Isotr0py left a comment

Uh oh!

Isotr0py Mar 18, 2025

Uh oh!

SystemPanic commented Mar 22, 2025

Uh oh!

Panchovix commented Mar 22, 2025 •

edited

Loading

Uh oh!

yurii-sio2 commented Mar 25, 2025

Uh oh!

mergify bot commented Apr 8, 2025

Uh oh!

AlpinDale commented May 10, 2025

Uh oh!

SystemPanic commented May 10, 2025

Uh oh!

Uh oh!

Uh oh!

[Kernel] vLLM Windows CUDA support #14891

[Kernel] vLLM Windows CUDA support #14891

Conversation

SystemPanic commented Mar 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Instructions for Windows build:

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

SystemPanic commented Mar 17, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad commented Mar 17, 2025

Uh oh!

SystemPanic commented Mar 17, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Isotr0py Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

SystemPanic commented Mar 22, 2025

Uh oh!

Panchovix commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yurii-sio2 commented Mar 25, 2025

Uh oh!

mergify bot commented Apr 8, 2025

Uh oh!

AlpinDale commented May 10, 2025

Uh oh!

SystemPanic commented May 10, 2025

Uh oh!

Uh oh!

SystemPanic commented Mar 16, 2025 •

edited by github-actions bot

Loading

Panchovix commented Mar 22, 2025 •

edited

Loading