Description
Motivation.
Use cases:
- AI open source community: Currently, llama.cpp and ollama are the main backend for serving open source AI models to non-developer users, used by Oobabooga, LM Studio and others. Non-devs AI users are 99% on Windows.
One thing I hate about current AI from a normal user point of view is the constant wheel reinvention.
We can't have 25 server types, 5000 kernels with different optimizations depending which engine and server you choose, and 10 quantization types. Such fragmentation causes fatigue, and non-dev users become lost when trying to get in the AI world, forgetting about it.
"Do you want to try Phi-4? Install this project, and his 400 dependencies.
Do you want to try Deepseek? Install that project and his 500 dependencies.
Do you want to try Llama Coder? Nice, install that another one project with 800 dependencies.
Oh wait, that last one requires Cuda 11.8, but you have Cuda 12.4 that is required by the first one. Oh! The second one is broken because need Cuda 12.6 newest features."
What a mess.
vLLM is the best server for serving AI models, and I think is a must to become adopted by those UI projects, so they can stop reinventing the wheel and can focus on shipping AI features and good UX instead of spending all the day replicating serving techniques, updating kernels, adapting their code to new models, etc.
Just make vLLM the standard. That will attract more developers to the project.
- Developers working with Windows: This is my case. Due to requirements by one of my companies, I'm stucked to support and work on Windows. And I cannot change all the day between Mac/Ubuntu and Windows.
For AI development, WSL2 is not a realistic option. PyCharm and Visual Studio Code doesn't play well with it (specially the debugger), and it's only useful for execute Linux server testing before deploying the code to pre-production environments.
- Windows servers: It's pretty common in public sector orgs to have Windows Server and Azure. Sooner or later that will have a demand. IBM know it, that's his business.
- Scholar students: Windows is the most used OS in schools. We want new generations to play and become AI users, or not?
Maintenance and porting costs:
All the modifications has been done in the PR, contributors just need to check how it's done in the PR to support Windows.
For example, when using process signaling, take care of the available signals on Windows, what are the dll names to be loaded as a lib, keep the windows modifications of some kernels when developing new ones, some basics such using static constexpr
instead of constexpr
when vars are used inside lambda functions, etc.
For CI Build costs, the cheapest option given the number of PR's in the project is a bare-metal Windows Server with a Buildkite self hosted agent and Docker, that runs a Windows Container image for all the environment installation with choco (CUDA + Python) and pip (Torch and rest of dependencies).
However, for simplicity, I suggest a Github Hosted Runner (it's quite cheap), as it doesn't require too much starting knowledge and developer time to integrate in the current CI, just make a very similar ci.yml witch choco install python312, environment variables and pip dependencies.
Maybe it can be executed only inside the full build check after the review instead of the fast check.
Proposed Change.
Windows CUDA support is done for v.0.7.4
Waiting for PR approval and merge #14891
Feedback Period.
No response
CC List.
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.