NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

eshoguli · 2024-08-01T13:30:09Z

Issues:

Default selected low precision kernel is not optimal for described below platform.
We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.

Platform:

system_profiler SPHardwareDataType
   Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,6
      Chip: Apple M3 Pro
      Total Number of Cores: 12 (6 performance and 6 efficiency)
      Memory: 18 GB

Operating System:

ProductName:		macOS
ProductVersion:		14.2.1
BuildVersion:		23C71

Command line

scons arch=arm64-v8.2-a neon=1 opencl=0 openmp=0 cppthreads=0 os=macos data_layout_support=all  build=native asserts=1 --jobs=8 --silent os=macos build=native fixed_format_kernels=True validation_tests=1 examples=1 debug=0

Single thread: cppthreads=0
Multithread: cppthreads=1

Results fp16, default kernel: a64_hybrid_fp16_mla_6x32)
Single thread, shapes: 4096x128 * 128x4096

fp16 time median time = 17373 microsecs

Multithread thread, shapes: 4096x128 * 128x4096

fp16 time median time = 2919 microsecs

Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16
Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12573 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 3595 microsecs

Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12
Single thread, shapes: 4096x128 * 128x4096:

int8 time median time = 12598 microsecs

Multithread, shapes: 4096x128 * 128x4096:

int8 time median time = 2113 microsecs

The text was updated successfully, but these errors were encountered:

morgolock · 2024-08-12T10:12:04Z

Hi @eshoguli

What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.

eshoguli · 2024-08-14T12:06:50Z

you can use eshoguli:es/neon_gemm_s8s8s32_perf_default branch to easily reproduce the issue

morgolock added the Question label Aug 4, 2024

morgolock changed the title ~~NEGEMMLowpMatrixMultiplyCore: performance issue~~ NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

eshoguli commented Aug 1, 2024 •

edited

Loading

morgolock commented Aug 12, 2024

eshoguli commented Aug 14, 2024

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

NEGEMMLowpMatrixMultiplyCore: performance issue int8 vs fp16 #1131

Comments

eshoguli commented Aug 1, 2024 • edited Loading

morgolock commented Aug 12, 2024

eshoguli commented Aug 14, 2024

eshoguli commented Aug 1, 2024 •

edited

Loading