Copybara import of the project: #9163
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Copybara import of the project:
--
c69ccdb by Gian Marco Iodice gianmarco.iodice@arm.com:
Prototype: Add support for fp16 iGEMM with SME2
Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com
--
a3537a1 by Gian Marco Iodice gianmarco.iodice@arm.com:
Include missing files
Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com
--
232826c by Gian Marco Iodice gianmarco.iodice@arm.com:
Update FP16 iGEMM based on review comments
Signed-off-by: Gian Marco Iodice gianmarco.iodice@arm.com
--
03bccaa by Jonathan Clohessy jonathan.clohessy@arm.com:
Updated FP16 iGemm Review with Fixes
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
--
9cd6e88 by Jonathan Clohessy Jonathan.Clohessy@arm.com:
Fix rebase issues
Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com
--
7eb618d by Misha Gutman aelphy@google.com:
Added multiple_of to handle all multiples in reductions simply.
No significant performance loss:
bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6)
PiperOrigin-RevId: 821549068
--
e5cb8c0 by Misha Gutman aelphy@google.com:
Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17%
bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14%
bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13%
bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13%
bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7%
PiperOrigin-RevId: 821556723
--
aeeca5d by Dillon Sharlet dsharlet@google.com:
Remove threadpool library and just build threadpool.cc as part of subgraph
PiperOrigin-RevId: 821566586
--
7304027 by Dillon Sharlet dsharlet@google.com:
Disable SME when msan is enabled
PiperOrigin-RevId: 821694771
--
89a72e3 by Dillon Sharlet dsharlet@google.com:
Don't bother disabling KleidiAI if using YNNPACK
This causes builds to fail, and it's harmless to leave it enabled.
PiperOrigin-RevId: 821704594
--
0c5edfc by Dillon Sharlet dsharlet@google.com:
Disable SME on older Apple compilers
PiperOrigin-RevId: 821708108
--
9b29972 by Dillon Sharlet dsharlet@google.com:
Fix usage of
sv{ld,st}1_hor_vnum_za32According to the ACLE documentation, this increments both the slice and the pointer by
vnumvectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.
PiperOrigin-RevId: 821730217
--
0d3dc09 by Dillon Sharlet dsharlet@google.com:
Fix correctness of dot benchmarks for transpose_a kernels
PiperOrigin-RevId: 821808685
--
4b73eb1 by Pedro Gonnet gonnet@google.com:
Update
pthreadpooldependency.PiperOrigin-RevId: 821857188
--
66d084b by Dillon Sharlet dsharlet@google.com:
Fix flaky quantize tests
PiperOrigin-RevId: 821867761
--
6fc5696 by Quentin Khan qkhan@google.com:
Add missing
gemm_config.element_sizeinitializations.PiperOrigin-RevId: 821984759
--
923b7f9 by Jonathan Clohessy Jonathan.Clohessy@arm.com:
Fix build issues and guard against sme2 specific path
Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com
--
06a44d2 by Jonathan Clohessy Jonathan.Clohessy@arm.com:
Refactor Convolution to new structure and fix build failures
Signed-off-by: Jonathan Clohessy Jonathan.Clohessy@arm.com
--
175903d by Jonathan Clohessy jonathan.clohessy@arm.com:
Remove unused gemm config structure init
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
--
999f4e3 by Jonathan Clohessy jonathan.clohessy@arm.com:
Updated code with sme variants of kernels and fixed tests
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
--
a2bd7aa by Jonathan Clohessy jonathan.clohessy@arm.com:
Updated ifdef guards and yml file
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
--
551cfde by Jonathan Clohessy jonathan.clohessy@arm.com:
Add new test case and fix issue with LHS pack
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
--
bcc62a0 by Jonathan Clohessy jonathan.clohessy@arm.com:
Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead
Signed-off-by: Jonathan Clohessy jonathan.clohessy@arm.com
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0