Integrate kleidiAI release v0.3.0 into MNN 2.9.6 #2995

xhzheng1895 · 2024-08-16T01:06:46Z

Put KleidiAI files in folder source/backend/cpu/arm/kleidiAI/kai, download from arm gitlab and remain unchanged. Maybe will remove these files and download them when build.

MNNKleidiAI.cpp is interface between MNN and KleidiAI.

Rewrite function in class DenseConvInt8TiledExecutor , in ConvInt8TiledExecutor.cpp, to call KleidiAI functions. Maybe implement a new execution later.

Changes to GeometryConvUtils.cpp and ShapeTensorConvert.cpp are for the input and output of DenseConvInt8TiledExecutor is NCHW, rather than NC4HW4, to avoid redundant pack/unpack and get better performance.

CLAassistant · 2024-08-16T01:06:52Z

All committers have signed the CLA.

wangzhaode · 2024-08-20T03:19:04Z

在M3芯片上测试了下面的2个模型，结果不正确

https://modelscope.cn/models/zhaode/Qwen2-7B-Instruct-MNN
https://modelscope.cn/models/zhaode/Qwen2-1.5B-Instruct-MNN

xhzheng1895 · 2024-08-21T01:28:23Z

Hi，现在kleidiAI只支持对称量化的模型。
对于非对称量化模型，会走到DenseConvInt8TiledExecutor原本的一些函数里。但是需要把KAI_CONV_NCHW_IN_OUT这个宏关掉，否则输入输出format会和DenseConvInt8TiledExecutor原生的函数不匹配。

wangzhaode · 2024-08-21T08:53:23Z

OK测试了一下对称量化的模型没有问题，decode性能相比MNN的原始实现有加速效果
在M3 Pro上测试Qwen2-1.5B-int4， CPU 4线程速度如下：

	prefill	decode
MNN	330	75
KleidiAI	295	85

yiyangfan01 · 2024-08-24T09:32:05Z

Here is the perf data I collected with the same model with @wangzhaode on RedMi K60 ultra(MTK D9300 inside), 16GB RAM, 4Threads.
Prefill has 57% improvement, decode has 28% improvement.

Put KleidiAI files in folder source/backend/cpu/arm/kleidiAI/kai, download from arm gitlab and remain unchanged. Maybe will remove these files and download them when build. MNNKleidiAI.cpp is interface between MNN and KleidiAI. Rewrite function in class DenseConvInt8TiledExecutor , in ConvInt8TiledExecutor.cpp, to call KleidiAI functions. Maybe implement a new execution later. Changes to GeometryConvUtils.cpp and ShapeTensorConvert.cpp are for the input and output of DenseConvInt8TiledExecutor is NCHW, rather than NC4HW4, to avoid redundant pack/unpack and get better performance.

xhzheng1895 closed this Aug 21, 2024

xhzheng1895 reopened this Aug 21, 2024

xhzheng1895 and others added 4 commits October 22, 2024 14:29

Bugfix of thread workload.

95a6e41

Update mnn_kleidiai interface.

644f22f

Update MNN to latest version

6f5be72

xhzheng1895 force-pushed the mnn_kai branch from 781ea88 to 6f5be72 Compare October 22, 2024 06:48

xhzheng1895 changed the title ~~Integrate kleidiAI release v0.1.0 into MNN 2.9.3~~ Integrate kleidiAI release v0.3.0 into MNN 2.9.6 Oct 22, 2024

Refine some code

39dadd0

xhzheng1895 marked this pull request as ready for review October 22, 2024 07:16

xhzheng1895 added 2 commits October 24, 2024 08:23

Refine CmakeList.txt

a1cbdf1

Refine rhs pack

2811524

wangzhaode self-assigned this Oct 28, 2024

add acthalf and blockwise condition in canAccelerate.

8f6a123

wangzhaode merged commit 630d593 into alibaba:master Oct 28, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate kleidiAI release v0.3.0 into MNN 2.9.6 #2995

Integrate kleidiAI release v0.3.0 into MNN 2.9.6 #2995

xhzheng1895 commented Aug 16, 2024

CLAassistant commented Aug 16, 2024 •

edited

Loading

wangzhaode commented Aug 20, 2024

xhzheng1895 commented Aug 21, 2024

wangzhaode commented Aug 21, 2024

yiyangfan01 commented Aug 24, 2024

Integrate kleidiAI release v0.3.0 into MNN 2.9.6 #2995

Integrate kleidiAI release v0.3.0 into MNN 2.9.6 #2995

Conversation

xhzheng1895 commented Aug 16, 2024

CLAassistant commented Aug 16, 2024 • edited Loading

wangzhaode commented Aug 20, 2024

xhzheng1895 commented Aug 21, 2024

wangzhaode commented Aug 21, 2024

yiyangfan01 commented Aug 24, 2024

CLAassistant commented Aug 16, 2024 •

edited

Loading