You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VLLM provides an easy-to-use backend access machanism and there are many backends have been integrated.
As shown in #6368, #6728, #6066, many users want to use vllm on Ascend NPU.
The main purpose of this RFC is to follow the existing backend access machanism and make Ascend NPU available for VLLM.
Proposed Change.
We introduce Ascend Executor/Worker(s) based on GPU Executor/Worker(s) as Ascend runtime management and worker on NPU. We also apply the Ascend Backend as the replacement of attention layer, the Page Attention/Flash Attention ops are implemented here.
Because torch_npu already natively supports torch since 2.1.0, we should try to keep it consistent with the GPU code and make the least code changes in our implements.
Ascend NPU is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars. There are many projects have supported Ascend NPU, such as onnxruntime, deepspeed, llama.cpp
MindIE is the Ascend inference engine, a high-performance deep learning inference framework, is designed based on Ascend hardware.
RoadMap
The initial version will include the following:
Ascend Executor
Ascend Worker
Ascend Model Runner
Ascend MindIE Backend
Ascend SingleOps Backend
The text was updated successfully, but these errors were encountered:
Motivation.
VLLM provides an easy-to-use backend access machanism and there are many backends have been integrated.
As shown in #6368, #6728, #6066, many users want to use vllm on Ascend NPU.
The main purpose of this RFC is to follow the existing backend access machanism and make Ascend NPU available for VLLM.
Proposed Change.
We introduce
Ascend Executor/Worker(s)
based onGPU Executor/Worker(s)
as Ascend runtime management and worker on NPU. We also apply theAscend Backend
as the replacement ofattention layer
, thePage Attention/Flash Attention
ops are implemented here.Because torch_npu already natively supports torch since 2.1.0, we should try to keep it consistent with the GPU code and make the least code changes in our implements.
Feedback Period.
A month
CC List.
@mgoin
@WoosukKwon
Any Other Things.
Background
Ascend NPU is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars. There are many projects have supported Ascend NPU, such as onnxruntime, deepspeed, llama.cpp
MindIE is the Ascend inference engine, a high-performance deep learning inference framework, is designed based on Ascend hardware.
RoadMap
The initial version will include the following:
The text was updated successfully, but these errors were encountered: