-
Notifications
You must be signed in to change notification settings - Fork 13.3k
CANN: Update several operators to support FP16 data format #16251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Which model size is this speed up for? |
Performance improved by 8%–10%. This result is based on our testing with the Qwen2.5 0.5B model using llama-parallel under 10 concurrent requests (we recently had a business case involving the 0.5B model). We also tested on Qwen2.5 7B, Qwen3-MoE, and DeepSeek V2-Lite, where we observed smaller performance gains. On Ascend, operators such as FA and MulMAT are computed in FP16 precision. However, in llama.cpp, intermediate results default to FP32, which introduces a nontrivial casting overhead. Using FP16 for intermediate results can reduce this casting cost. Of course, we also tried computing operators directly in FP32, but due to the higher computation cost, the performance was actually worse than the cast+FP16 approach. This PR only modifies the operators so that they support both FP32 and FP16 data types. To fully adopt FP16 as the intermediate type, further changes are required in other parts of the code. I will submit an issue and a draft PR today to start a discussion on this. #16271 |
Test pass for modified operators:
|
Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>
Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements.
In this change,
get_rows
,rms_norm
, andflash_attn_ext
are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios, with #16270Make sure to read the contributing guidelines before submitting a PR