Skip to content

Conversation

@copybara-service
Copy link
Contributor

@copybara-service copybara-service bot commented Nov 17, 2025

Copybara import of the project:

--
c69ccdb by Gian Marco Iodice gianmarco.iodice@arm.com:

Prototype: Add support for fp16 iGEMM with SME2

  • Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com

--
a3537a1 by Gian Marco Iodice gianmarco.iodice@arm.com:

Include missing files

Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com

--
232826c by Gian Marco Iodice gianmarco.iodice@arm.com:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com

--
03bccaa by Jonathan Clohessy jonathan.clohessy@arm.com:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com

--
9cd6e88 by Jonathan Clohessy Jonathan.Clohessy@arm.com:

Fix rebase issues

Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com

--
7eb618d by Misha Gutman aelphy@google.com:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman aelphy@google.com:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17%
bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14%
bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13%
bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet dsharlet@google.com:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet dsharlet@google.com:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet dsharlet@google.com:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet dsharlet@google.com:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet dsharlet@google.com:

Fix usage of sv{ld,st}1_hor_vnum_za32

According to the ACLE documentation, this increments both the slice and the pointer by vnum vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet dsharlet@google.com:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet gonnet@google.com:

Update pthreadpool dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet dsharlet@google.com:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan qkhan@google.com:

Add missing gemm_config .element_size initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy Jonathan.Clohessy@arm.com:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com

--
06a44d2 by Jonathan Clohessy Jonathan.Clohessy@arm.com:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com

--
175903d by Jonathan Clohessy jonathan.clohessy@arm.com:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com

--
999f4e3 by Jonathan Clohessy jonathan.clohessy@arm.com:

Updated code with sme variants of kernels and fixed tests

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com

--
a2bd7aa by Jonathan Clohessy jonathan.clohessy@arm.com:

Updated ifdef guards and yml file

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com

--
551cfde by Jonathan Clohessy jonathan.clohessy@arm.com:

Add new test case and fix issue with LHS pack

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com

--
bcc62a0 by Jonathan Clohessy jonathan.clohessy@arm.com:

Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead

Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0

@copybara-service copybara-service bot force-pushed the test_833326167 branch 2 times, most recently from 2597292 to 9b4b12a Compare November 18, 2025 19:56
@copybara-service copybara-service bot closed this Nov 18, 2025
@copybara-service copybara-service bot deleted the test_833326167 branch November 18, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants