Copybara import of the project: #9163

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0% 1.719µ ± 17% ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3% 1.753µ ± 14% ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1% 1.216µ ± 17% ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.217µ ± 0% 1.216µ ± 15% ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.263µ ± 1% 2.268µ ± 0% ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.342µ ± 0% 4.357µ ± 0% ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time [256x1x256x1] 2.221µ ± 0% 2.285µ ± 8% ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time [256x1x256x1] 2.219µ ± 1% 2.279µ ± 2% +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.344µ ± 0% 2.345µ ± 7% ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time [256x1x256x1] 4.318µ ± 0% 4.328µ ± 0% +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time [256x1x256x1] 4.319µ ± 0% 4.325µ ± 1% ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.790µ ± 0% 8.795µ ± 0% ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time [256x1x256x1] 3.966µ ± 0% 3.995µ ± 0% +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time [256x1x256x1] 5.382µ ± 1% 5.410µ ± 1% +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time [256x1x256x1] 3.977µ ± 0% 3.994µ ± 1% +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time [256x1x256x1] 5.373µ ± 0% 5.412µ ± 2% +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman aelphy@google.com:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1] 1.441µ ± 17%
bench/sum_fp32_avx2/real_time [256x1x256x1] 1.761µ ± 14%
bench/sum_fp32_sse2/real_time [256x1x256x1] 3.435µ ± 13%
bench/sum_fp32/real_time [256x1x256x1] 3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1] 1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time [256x1x256x1] 1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet dsharlet@google.com:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet dsharlet@google.com:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet dsharlet@google.com:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet dsharlet@google.com:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet dsharlet@google.com:

Fix usage of sv{ld,st}1_hor_vnum_za32

According to the ACLE documentation, this increments both the slice and the pointer by vnum vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet dsharlet@google.com:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet gonnet@google.com:

Update pthreadpool dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet dsharlet@google.com:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan qkhan@google.com:

Add missing gemm_config .element_size initializations.