[Hardware][Ascend] Add Ascend NPU backend #8054

wangshuai09 · 2024-08-31T06:55:59Z

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

Support Device

Atlas 800I A2 Inference Server
Atlas 800T A2 Training Server
Atals 300T A2 Training Card

Install

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py

Using Dockerfile.npu

Clone branch npu_support and step into vllm

git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm

Build the docker image

docker build -t vllm-npu -f Dockerfile.npu .

Run docker container.
modify --device /dev/davinci0 according to your device.

docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash

Enter the container

docker exec -it vllm bash

Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie
@zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

github-actions · 2024-08-31T06:56:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

zer0py2c · 2024-09-01T14:03:58Z

Is there any document on how to use it?

wangshuai09 · 2024-09-02T01:42:01Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1
run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm
test python examples/offline_inference_npu.py, only support single prompt now.

zer0py2c · 2024-09-02T02:22:19Z

Is there any document on how to use it?

This work is not ready, if you want to develop this together, follow this,

install CANN, make sure the version matches torch2.1

run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm

test python examples/offline_inference_npu.py, only support single prompt now.

very thankful, I'll try it.

wyzanski · 2024-09-02T10:38:20Z

I followed the above steps and reported the following error. What is the reason?

wangshuai09 · 2024-09-02T11:14:02Z

@wyzanski There is a fatal error about git, i think you may need to recheck your git config.

Aiwenqiuyu · 2024-09-09T03:21:27Z

期待对国产化的支持！

Co-authored-by: MengqingCao <cmq0113@163.com>

…pport

jkl375 · 2024-09-11T02:38:58Z

感谢对国产化的支持！

* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling

MengqingCao · 2024-09-11T09:31:37Z

TODO:

update vllm/attention/backends/ascend.py to the latest version.

XYZliang · 2024-09-12T06:10:43Z

感谢对国产化的支持！期待在昇腾系列上的效果，太缺一个高效的推理引擎了

beardog6 · 2024-09-14T03:15:49Z

是否支持在线推理呢

wangshuai09 · 2024-09-18T09:10:51Z

是否支持在线推理呢

Does it means starting an OpenAI-compatible API server? The latest code already supports, like this:

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

XYZliang · 2024-09-18T09:25:01Z

What Ascend NPU devices are currently supported?
The latest version of lmdeploy also supports Ascend NPU, but only 910B and 310P are supported, as other devices lack the operator support they require and will need to wait for CANN implementation. I encounter errors when testing with the 910A.
However, it seems that most users are using Ascend 910A. Is it possible to adapt it directly?

WangxuP · 2024-09-18T09:56:22Z

是否支持在线推理呢

是不是意味着要启动一个兼容 OpenAI 的 API 服务器呢？最新的代码已经支持了，像这样：

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

suooprted qwen series LLM？

wangshuai09 · 2024-09-18T09:59:21Z

Hi @XYZliang, 910A is not supported now, we will work on supports for more type of devices.

wangshuai09 · 2024-09-18T10:09:49Z

@WangxuP we do not check the model corretness now, here is a simple offline result:

INFO 09-18 10:03:24 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:24 selector.py:161] Using ASCEND_TORCH backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 09-18 10:03:33 npu_model_runner.py:319] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 09-18 10:03:33 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-18 10:03:33 selector.py:161] Using ASCEND_TORCH backend.
INFO 09-18 10:03:34 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.43it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.29it/s]

INFO 09-18 10:03:39 npu_model_runner.py:330] Loading model weights took 14.2487 GB
/workspace/cmq/ws-code/vllm/vllm/model_executor/layers/sampler.py:437: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:74.)
  top_p_mask[:, -1] = False
INFO 09-18 10:03:45 gpu_executor.py:122] # GPU blocks: 37996, # CPU blocks: 4681
Processed prompts: 100%|████████| 2/2 [00:04<00:00,  2.34s/it, est. speed input: 2.56 toks/s, output: 42.72 toks/s]
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president is the commander-in-chief of the armed forces, the head of the executive branch, and is responsible for enforcing federal laws, taking care that federal laws are faithfully executed, and serving as the commander in chief of the armed forces. The president is also the head of state and represents the nation to foreign governments and to the world at large. The president is the chief diplomat, the chief executive, and the chief legislator of'
Prompt: 'The future of AI is', Generated text: " here, and it's not just about robots and self-driving cars. AI is transforming every industry, from healthcare to finance, and it's changing the way we live and work. In this article, we'll explore the latest advancements in AI and how they're impacting our world.\nOne of the most exciting areas of AI research is natural language processing (NLP). NLP is the ability of machines to understand and interpret human language. This technology is being used to create chatbots, virtual assistants,"

RogerWYQ · 2024-09-18T15:19:49Z

should we install mindie first?

zhangzhiqiangcs · 2024-09-19T01:25:53Z

Is there a Dockerfile for npu to build image ?

1737686924 · 2024-10-08T14:21:37Z

ascend vllm 请问是否计划适配 qwen2-vl呢？

对 VLM 的支持在我们的待办事项列表中，包括 qwen2-vl。

现在npu上支持qwen2-vl了么，有相应的pr可以参考吗？

wangshuai09 · 2024-10-09T01:14:45Z

Are there any plans for the adaptation of 300I Duo?

Support 300I Duo is in our to-do list, but it`s not a high priority at the moment.

MengqingCao · 2024-10-09T07:28:39Z

ascend vllm 请问是否计划适配 qwen2-vl呢？

对 VLM 的支持在我们的待办事项列表中，包括 qwen2-vl。

现在npu上支持qwen2-vl了么，有相应的pr可以参考吗？

Not support currently

ccly1996 · 2024-10-11T06:47:28Z

如何修改模型运行时的config呢，目前使用910b运行qwen1.5 7b时告知要调整参数

xymak · 2024-10-11T06:53:13Z

As mentioned in #7692, this PR make Ascend NPU backend available in VLLM.

RoadMap:

Ascend Executor

Ascend Worker

Ascend Model Runner

Ascend SingleOps Backend

custom_ops with native impl

padding for multi prompts

update vllm/attention/backends/ascend.py to the latest version.

model inference: opt, llama

multiproc

Platform for Ascend NPU

Server

Unit-test

Support Device

Atlas 800I A2 Inference Server

Atlas 800T A2 Training Server

Atals 300T A2 Training Card

Install

install CANN, make sure the version matches torch2.1

run VLLM_TARGET_DEVICE=npu pip install -e . to install vllm

test python examples/offline_inference_npu.py

Using Dockerfile.npu

Clone branch npu_support and step into vllm
git clone -b npu_support https://github.com/wangshuai09/vllm.git
cd vllm
Build the docker image
docker build -t vllm-npu -f Dockerfile.npu .
Run docker container.
modify --device /dev/davinci0 according to your device.
docker run -dit -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /etc/ascend_install.info:/etc/ascend_install.info --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc --shm-size 16G --name vllm vllm-npu:latest bash
Enter the container
docker exec -it vllm bash
Collaborators

@MengqingCao @dgy516 @hi-liuyifeng @Lin-Qingyang-Alec @liujie92 @JiasenTian @weiwei567 @JuntongMa @xiangjie @zhangxy1234 @ldh2020 @Eviannn @agoodnoob @rumoralot

This work is still in WIP stage.

@wangshuai09 @MengqingCao

同学有没有联系方式？给个邮箱，想跟您沟通一下，我们公司这边想投入两个人参与开发。或者用 xiyuanmak@gmail.com
联系一下我?

wangshuai09 · 2024-10-11T07:21:06Z

@ccly1996 You can refer to

vllm/examples/offline_inference_neuron.py

Line 29 in cbc2ef5

max_model_len=2048,

ccly1996 · 2024-10-11T07:35:54Z

@ccly1996 You can refer to

vllm/examples/offline_inference_neuron.py

Line 29 in cbc2ef5

max_model_len=2048,

谢谢，目前可以通过openai接口访问npu部署的模型了吗，另外vllm的FLASH_ATTN已经支持了吗

MengqingCao · 2024-10-11T09:29:58Z

谢谢，目前可以通过openai接口访问npu部署的模型了吗

Yes, you can use openai api servser on Ascend NPU now.

另外vllm的FLASH_ATTN已经支持了吗

Flash attention is supported by operators in torch_npu, instead of flash_attn.
Just ignore the warning Cannot use _Backend.FLASH_ATTN backend on NPU.

ccly1996 · 2024-10-12T01:12:22Z

谢谢，目前可以通过openai接口访问npu部署的模型了吗

Yes, you can use openai api servser on Ascend NPU now.

另外vllm的FLASH_ATTN已经支持了吗

Flash attention is supported by operators in torch_npu, instead of flash_attn. Just ignore the warning Cannot use _Backend.FLASH_ATTN backend on NPU.

只是vllm serve model就可以了吗，目前按这个步骤启动会报错误“cannot import name PoolingParams from vllm”

MengqingCao · 2024-10-14T03:25:55Z

只是vllm serve model就可以了吗，目前按这个步骤启动会报错误“cannot import name PoolingParams from vllm”

You can start a server by running a command likepython3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m ... (other configs). And then use a client script (example) to access it.

forrestjgq · 2024-10-16T08:06:24Z

@wangshuai09 I've got an error report when run offline inference from example, could you give me some advice?

root@k8s-master-78:/home/gqjiang/vllm# python3 examples/offline_inference_npu.py
INFO 10-16 07:49:16 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 3.21MB/s]
WARNING 10-16 07:49:20 config.py:380] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 10-16 07:49:20 llm_engine.py:234] Initializing an LLM engine (v0.1.dev2868+ga02b772) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 4.05MB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:01<00:00, 728kB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.27MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 2.49MB/s]
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 853kB/s]
INFO 10-16 07:49:26 selector.py:221] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 10-16 07:49:26 selector.py:151] Using ASCEND backend.
[W compiler_depend.ts:623] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
INFO 10-16 07:49:40 model_runner.py:1024] Starting to load model facebook/opt-125m...
INFO 10-16 07:49:40 selector.py:221] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 10-16 07:49:40 selector.py:151] Using ASCEND backend.
INFO 10-16 07:49:41 weight_utils.py:242] Using model weights format ['*.bin']
pytorch_model.bin:  29%|████████████████████████████████████████████████▋                                                                                                                     | 73.4M/251M [00:12<00:26, 6.63MB/s]pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:47<00:00, 5.30MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.09it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.09it/s]

INFO 10-16 07:50:32 model_runner.py:1035] Loading model weights took 0.2389 GB
INFO 10-16 07:50:32 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241016-075032.pkl...
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
INFO 10-16 07:51:01 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241016-075032.pkl.
Traceback (most recent call last):
  File "/home/gqjiang/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/model_runner.py", line 1608, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 326, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 291, in forward
    return self.decoder(input_ids,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 260, in forward
    hidden_states = layer(hidden_states, kv_caches[i], attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 162, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/model_executor/models/opt.py", line 105, in forward
    attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/attention/layer.py", line 98, in forward
    return self.impl.forward(query,
  File "/home/gqjiang/vllm/vllm/attention/backends/ascend.py", line 473, in forward
    output = torch_npu.npu_prompt_flash_attention(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-10-16-07:50:32.915.202 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        attention mask must be NULL，when Qs,Kvs is unAlign or Qs is not equal to Kvs, Qs = 8, Kvs = 8[FUNC:RunBigKernelTilingWithParams][FILE:prompt_flash_attention_tiling.cpp][LINE:2081]
        Tiling failed
        Tiling Failed.
        Kernel GetWorkspace failed. opType: 4
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-10-16-07:50:32 (PID:1428, Device:0, RankID:-1) ERR01005 OPS internal error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gqjiang/vllm/examples/offline_inference_npu.py", line 29, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/gqjiang/vllm/vllm/entrypoints/llm.py", line 214, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 585, in from_engine_args
    engine = cls(
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 349, in __init__
    self._initialize_kv_caches()
  File "/home/gqjiang/vllm/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/home/gqjiang/vllm/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/npu_worker.py", line 148, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/npu_model_runner.py", line 271, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/gqjiang/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-075032.pkl): call aclnnPromptFlashAttentionV3 failed, detail:EZ1001: 2024-10-16-07:50:32.915.202 PromptFlashAttention LaunchAicore failed.
        TraceBack (most recent call last):
        attention mask must be NULL，when Qs,Kvs is unAlign or Qs is not equal to Kvs, Qs = 8, Kvs = 8[FUNC:RunBigKernelTilingWithParams][FILE:prompt_flash_attention_tiling.cpp][LINE:2081]
        Tiling failed
        Tiling Failed.
        Kernel GetWorkspace failed. opType: 4
        PromptFlashAttention LaunchAicore failed.

[ERROR] 2024-10-16-07:50:32 (PID:1428, Device:0, RankID:-1) ERR01005 OPS internal error

sushe2111 · 2024-10-17T10:11:19Z

I got a warning，is this a problem?
/torch_npu/distributed/distributed_c10d.py:110: UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.
warnings.warn("HCCL doesn't support gather at the moment. Implemented with allgather instead.")

WWCTF · 2024-10-17T11:53:45Z

在310p上进行开发，300I DUO上推理生成乱码，对比910B上成功推理对比多出一条警告

请问这是硬件还是软件方面的问题，该怎么处理？

MengqingCao · 2024-10-17T12:22:46Z

UserWarning: HCCL doesn't support gather at the moment. Implemented with allgather instead.

This does not affect the existing functions, but only prompts to use the allgather operator instead of gather to complete the communication between NPU cards

MengqingCao · 2024-10-17T12:27:23Z

在310p上进行开发，300I DUO上推理生成乱码，对比910B上成功推理对比多出一条警告 !

310P is not supported currently. The args passing into FA operators are a little different on 310p, maybe it cause the wrong inferencing results.

new-TonyWang · 2024-10-23T09:05:40Z

牛逼，请问现在910B/310P现在可以使用了吗？

ccly1996 · 2024-10-23T09:11:10Z

测过了可以用，但是性能比mindie还是差一些

…

new-TonyWang · 2024-10-23T09:14:20Z

测过了可以用，但是性能比mindie还是差一些
…
---- Replied Message ---- | From | @.> | | Date | 10/23/2024 17:06 | | To | vllm-project/vllm @.> | | Cc | ccly1996 @.>, Mention @.> | | Subject | Re: [vllm-project/vllm] [Hardware][Ascend] Add Ascend NPU backend (PR #8054) | 牛逼，请问现在910B/310P现在可以使用了吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

我也是华为内部的同事，想请教您一些问题。我该如何和您联系呢？我的名字是wangtongyu

forrestjgq · 2024-10-23T09:31:32Z

测过了可以用，但是性能比mindie还是差一些
…
---- Replied Message ---- | From | @.> | | Date | 10/23/2024 17:06 | | To | vllm-project/vllm @.> | | Cc | ccly1996 @.>, Mention @.> | | Subject | Re: [vllm-project/vllm] [Hardware][Ascend] Add Ascend NPU backend (PR #8054) | 牛逼，请问现在910B/310P现在可以使用了吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

@ccly1996 310p上怎么跑的啊？我这边怎么推理出来的结果还是不对呢

ccly1996 · 2024-10-23T12:28:50Z

我是910B上跑的

…

sisrfeng · 2024-10-25T01:23:45Z

Will vllM with Ascend NPU backend become a competitor to MindIE?

mergify · 2024-10-29T03:01:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. @wangshuai09 please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

wangshuai09 · 2024-10-30T03:54:27Z

Will vllM with Ascend NPU backend become a competitor to MindIE?

I don't think it's competition. Because the Ascend NPU Backend is single op mode which use the attention ops in torch_npu and supports most models. And MindIE is whole graph mode which support a few models.

nakroy · 2024-10-30T07:37:14Z

I have some machines with 8 x 910B and 4 x 310P(300I Duo) , and anyone wants to develop vllm project supporting with Ascend NPU backend can contact with me or email me, and I'm sure I can offer you the bare metal machine to use, thanks again to all you guys for developing this project

xuedinge233 mentioned this pull request Aug 31, 2024

[Feature]: Request for Ascend NPU support #6368

Open

init npu_support

6ae737e

Co-authored-by: MengqingCao <cmq0113@163.com>

wangshuai09 force-pushed the npu_support branch from 6f89d38 to 6ae737e Compare September 9, 2024 07:05

wangshuai09 and others added 4 commits September 10, 2024 01:11

not compile _core_ext

e52cae6

pad input tokens/positions

136be9f

support custom_op by native

10da669

Merge branch 'npu_support' of github.com:wangshuai09/vllm into npu_su…

b8af541

…pport

Some fixes for multi-prompt inference acc

e26bc8c

* pad slot indices * use parameter passing instead of global var to control whether pad length is calculated in the sampling

wangshuai09 and others added 3 commits September 14, 2024 09:18

refactor

47e1d7c

refactor attention and slot indices

89e298e

support api server

d6dd620

add ascend in platform

734b1a9

revert changes on test

26429a5

MengqingCao force-pushed the npu_support branch from 4abc281 to 26429a5 Compare October 9, 2024 07:05

MengqingCao added 2 commits October 9, 2024 07:54

fix acc

573b909

update dockerfile

a02b772

qwen2vl Ascend support

a2bca0e

mergify bot added the needs-rebase label Oct 29, 2024

[Hardware][Ascend] Add Ascend NPU backend #8054

Are you sure you want to change the base?

[Hardware][Ascend] Add Ascend NPU backend #8054

Conversation

wangshuai09 commented Aug 31, 2024 • edited Loading

RoadMap:

Support Device

Install

Using Dockerfile.npu

Collaborators

github-actions bot commented Aug 31, 2024

zer0py2c commented Sep 1, 2024

wangshuai09 commented Sep 2, 2024

zer0py2c commented Sep 2, 2024

wyzanski commented Sep 2, 2024

wangshuai09 commented Sep 2, 2024

Aiwenqiuyu commented Sep 9, 2024

jkl375 commented Sep 11, 2024

MengqingCao commented Sep 11, 2024 • edited Loading

XYZliang commented Sep 12, 2024

beardog6 commented Sep 14, 2024

wangshuai09 commented Sep 18, 2024 • edited Loading

XYZliang commented Sep 18, 2024

WangxuP commented Sep 18, 2024

wangshuai09 commented Sep 18, 2024 • edited Loading

wangshuai09 commented Sep 18, 2024

RogerWYQ commented Sep 18, 2024

zhangzhiqiangcs commented Sep 19, 2024

1737686924 commented Oct 8, 2024

wangshuai09 commented Oct 9, 2024

MengqingCao commented Oct 9, 2024

ccly1996 commented Oct 11, 2024

xymak commented Oct 11, 2024 • edited Loading

RoadMap:

Support Device

Install

Using Dockerfile.npu

Collaborators

wangshuai09 commented Oct 11, 2024

ccly1996 commented Oct 11, 2024

MengqingCao commented Oct 11, 2024

ccly1996 commented Oct 12, 2024

MengqingCao commented Oct 14, 2024

forrestjgq commented Oct 16, 2024

sushe2111 commented Oct 17, 2024 • edited Loading

WWCTF commented Oct 17, 2024

MengqingCao commented Oct 17, 2024

MengqingCao commented Oct 17, 2024

new-TonyWang commented Oct 23, 2024

ccly1996 commented Oct 23, 2024 via email

new-TonyWang commented Oct 23, 2024

forrestjgq commented Oct 23, 2024

ccly1996 commented Oct 23, 2024 via email

sisrfeng commented Oct 25, 2024

mergify bot commented Oct 29, 2024

wangshuai09 commented Oct 30, 2024

nakroy commented Oct 30, 2024 • edited Loading

wangshuai09 commented Aug 31, 2024 •

edited

Loading

MengqingCao commented Sep 11, 2024 •

edited

Loading

wangshuai09 commented Sep 18, 2024 •

edited

Loading

wangshuai09 commented Sep 18, 2024 •

edited

Loading

xymak commented Oct 11, 2024 •

edited

Loading

sushe2111 commented Oct 17, 2024 •

edited

Loading

nakroy commented Oct 30, 2024 •

edited

Loading