-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][Ascend] Add Ascend NPU backend #8054
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Is there any document on how to use it? |
This work is not ready, if you want to develop this together, follow this,
|
very thankful, I'll try it. |
@wyzanski There is a fatal error about git, i think you may need to recheck your git config. |
期待对国产化的支持! |
Co-authored-by: MengqingCao <cmq0113@163.com>
6f89d38
to
6ae737e
Compare
感谢对国产化的支持! |
* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling
TODO:
|
感谢对国产化的支持!期待在昇腾系列上的效果,太缺一个高效的推理引擎了 |
是否支持在线推理呢 |
Does it means starting an OpenAI-compatible API server? The latest code already supports, like this: # start server
vllm serve facebook/opt-125m
# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'
# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live. I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}} |
What Ascend NPU devices are currently supported? |
suooprted qwen series LLM? |
Hi @XYZliang, 910A is not supported now, we will work on supports for more type of devices. |
@WangxuP we do not check the model corretness now, here is a simple offline result:
|
should we install mindie first? |
Is there a Dockerfile for npu to build image ? |
现在npu上支持qwen2-vl了么,有相应的pr可以参考吗? |
Support 300I Duo is in our to-do list, but it`s not a high priority at the moment. |
4abc281
to
26429a5
Compare
Not support currently |
同学有没有联系方式?给个邮箱,想跟您沟通一下,我们公司这边想投入两个人参与开发。或者用 xiyuanmak@gmail.com |
@ccly1996 You can refer to vllm/examples/offline_inference_neuron.py Line 29 in cbc2ef5
|
谢谢,目前可以通过openai接口访问npu部署的模型了吗,另外vllm的FLASH_ATTN已经支持了吗 |
Yes, you can use openai api servser on Ascend NPU now.
Flash attention is supported by operators in |
只是vllm serve model就可以了吗,目前按这个步骤启动会报错误“cannot import name PoolingParams from vllm” |
You can start a server by running a command like |
@wangshuai09 I've got an error report when run offline inference from example, could you give me some advice?
|
I got a warning,is this a problem? |
This does not affect the existing functions, but only prompts to use the |
310P is not supported currently. The args passing into FA operators are a little different on 310p, maybe it cause the wrong inferencing results. |
牛逼,请问现在910B/310P现在可以使用了吗? |
测过了可以用,但是性能比mindie还是差一些
…---- Replied Message ----
| From | ***@***.***> |
| Date | 10/23/2024 17:06 |
| To | vllm-project/vllm ***@***.***> |
| Cc | ccly1996 ***@***.***>,
Mention ***@***.***> |
| Subject | Re: [vllm-project/vllm] [Hardware][Ascend] Add Ascend NPU backend (PR #8054) |
牛逼,请问现在910B/310P现在可以使用了吗?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
我也是华为内部的同事,想请教您一些问题。我该如何和您联系呢?我的名字是wangtongyu |
@ccly1996 310p上怎么跑的啊?我这边怎么推理出来的结果还是不对呢 |
我是910B上跑的
…---- Replied Message ----
| From | ***@***.***> |
| Date | 10/23/2024 17:31 |
| To | vllm-project/vllm ***@***.***> |
| Cc | ccly1996 ***@***.***>,
Mention ***@***.***> |
| Subject | Re: [vllm-project/vllm] [Hardware][Ascend] Add Ascend NPU backend (PR #8054) |
测过了可以用,但是性能比mindie还是差一些
…
---- Replied Message ---- | From | @.> | | Date | 10/23/2024 17:06 | | To | vllm-project/vllm @.> | | Cc | ccly1996 @.>, Mention @.> | | Subject | Re: [vllm-project/vllm] [Hardware][Ascend] Add Ascend NPU backend (PR #8054) | 牛逼,请问现在910B/310P现在可以使用了吗? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@ccly1996 310p上怎么跑的啊?我这边怎么推理出来的结果还是不对呢
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Will vllM with Ascend NPU backend become a competitor to MindIE? |
This pull request has merge conflicts that must be resolved before it can be |
I don't think it's competition. Because the Ascend NPU Backend is single op mode which use the attention ops in |
I have some machines with 8 x 910B and 4 x 310P(300I Duo) , and anyone wants to develop vllm project supporting with Ascend NPU backend can contact with me or email me, and I'm sure I can offer you the bare metal machine to use, thanks again to all you guys for developing this project |
As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.
RoadMap:
Support Device
Install
VLLM_TARGET_DEVICE=npu pip install -e .
to install vllmpython examples/offline_inference_npu.py
Using Dockerfile.npu
modify
--device /dev/davinci0
according to your device.Collaborators
@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot
This work is still in WIP stage.