-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Kernel] vLLM Windows CUDA support #14891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having windows support is nice. Do we have plan to introduce windows CI? Otherwise, it's easy to get it broken again.
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Yes, I agreed. I don't expect an implementation to be difficult using Buildkite, it has support for Windows self-hosted agents for private servers and AWS EC2 instances. Another way is using Github hosted runners only for Windows. An example optimized workflow with CUDA 12.4: karpathy/llm.c#401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to block this PR for an RFC discussion first, mainly to answer the questions of maintenance burden and feature popularity. Do we expect users to use vLLM on Windows? How heavy is the maintenance and porting cost?
The key is some commitment on the Windows support for the long run. Otherwise, it's very easy to run into some broken status. |
Hi @simon-mo Done. I'm quite busy and don't have the time right now to discuss the RFC, so I'll let the community to debate and show the interest for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad to see vLLM support Windows, but I'm afraid that modification on kernels to support MSVC toolchain will increase the costs when porting new kernels.
I wonder if windows GNU toolchain can compile the kernels without these kernel modification?
if platform.system() == "Windows": | ||
parent_process.send_signal(signal.SIGTERM) | ||
else: | ||
parent_process.send_signal(signal.SIGUSR1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a helper function to send terminate signal compatible with windows and unix, instead of having if platform.system() == "Windows": ... else: ...
everywhere.
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
While the PR / RFC is discussed, I will release Windows wheels from time to time in the project fork The project fork doesn't mean official support, as I don't have the time to keep all the changes up to date for new wheels. To show the interest for vLLM on Windows, join to the discussion at the RFC |
This is amazing, many thanks! I will try to build with my system (Blackwell 2.0 + Ada Lovelace + Ampere) on Windows and will comment how it goes. |
My vote for support Windows version. |
This pull request has merge conflicts that must be resolved before it can be |
I tried doing this a while ago at aphrodite-engine/aphrodite-engine#790. It was too much work, and the maintenance burden was too great to keep it working for future releases. I generally don't recommend. Users have better options (e.g. llamacpp or one of its forks) for Windows. I also noticed that WSL was faster than native Windows execution, so that makes this effort even more redundant, imo. |
I disagree. No one is requesting support for each commit done to the repo, only for releases. I maintain the Windows fork, and fixing incompatibilities when vLLM team publish a new release take me, depending the quantity of changes, 1-2 hours. It would be significantly less, like 30 minutes per release, or no job required at all, if the fork is merged and only build errors are fixed, instead of dealing with git conflicts on barely touched files for each release, over and over again. Considering 4 releases per month, that is less than 2 hours of work per month. I understand that vLLM team has limited resources and need to focus on new characteristics and bug fixing to stay competitive, considering the fast evolving field that AI is and the crazyness of new models released each week.
That's not true, not for the Windows vLLM fork. I will close this PR, as is outdated and discussion is moved to an RFC. |
This PR fixes #2309 #2242 #2242 #669 #5086 #5631 #1685 #179 #2309 #669 and includes:
FlashInfer build for Windows will be added today in a PR in their corresponding repository.
Kernels changes are the minimal to make the MSVC compiler happy and doesn't change any kernel functionality itself. Special mention to GPTQ_Marlin, where the excessive ELSE IF clauses inside a single function has been splitted into smaller functions to avoid a C1061 error.
Instructions for Windows build:
A Visual Studio 2019 or newer is required to launch the compiler x64 environment. The installation path is referred in the instructions as VISUAL_STUDIO_INSTALL_PATH.
CUDA path will be found automatically if you have the bin folder in your PATH, or have the CUDA installation path settled on well-known environment vars like CUDA_ROOT, CUDA_HOME or CUDA_PATH.
If none of these are present, make sure to set the environment variable before starting the build:
set CUDA_ROOT=CUDA_INSTALLATION_PATH
The next steps are the same as building from source with GPU support at section "Full build (with compilation)": https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#full-build-with-compilation
As a note, some sm100 kernels in Cutlass v3.8.0 has compilation errors on Windows. A fix has been submitted to Nvidia (see NVIDIA/cutlass#2167).
Until Nvidia accepts the PR, only for Windows environment, FetchContent_Declare will clone Cutlass v3.8.0 from a branch with the fix. Feel free to remove that part when the Nvidia PR has been merged (keep the rest of changes for cuBLAS and VLLM_GPU_FLAGS).
FIX #2309
FIX #2242
FIX #669
FIX #5086
FIX #5631
FIX #1685
FIX #179
FIX #2309