Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

Open
eshoguli opened this issue Aug 1, 2024 · 2 comments
Open

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

eshoguli opened this issue Aug 1, 2024 · 2 comments
Labels

Comments

@eshoguli
Copy link

eshoguli commented Aug 1, 2024

Issues:

  1. Default selected low precision kernel is not optimal for described below platform.
  2. We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.

Platform:

system_profiler SPHardwareDataType
   Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,6
      Chip: Apple M3 Pro
      Total Number of Cores: 12 (6 performance and 6 efficiency)
      Memory: 18 GB

Operating System:

ProductName:		macOS
ProductVersion:		14.2.1
BuildVersion:		23C71

Command line

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=0 os=macos data_layout_support=all  build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0

Single thread: cppthreads=0
Multithread: cppthreads=1

Results fp16, default kernel: a64_hybrid_fp16_mla_6x32)
Single thread, shapes: 4096x128 * 128x4096

fp16 time median time = 17373 microsecs

Multithread thread, shapes: 4096x128 * 128x4096

fp16 time median time = 2919 microsecs

Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16
Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12573 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 3595 microsecs

Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12
Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12598 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 2113 microsecs
@morgolock morgolock changed the title NEGEMMLowpMatrixMultiplyCore: performance issue NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 Aug 4, 2024
@morgolock
Copy link

Hi @eshoguli

What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.

@eshoguli
Copy link
Author

you can use eshoguli:es/neon_gemm_s8s8s32_perf_default branch to easily reproduce the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants