Skip to content

Commit 5f9e97f

Browse files
taronaeoslaren
authored andcommitted
ggml-cpu: enable IBM NNPA Vector Intrinsics (llama/14317)
* ggml-cpu: add nnpa compile flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1) * ggml-cpu: add fp16->fp32 nnpa first Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929) * ggml-cpu: add fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627) * ggml-cpu: better variable names Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f) * docs: update s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7) * ggml-cpu: add debugging prints to see if dlf16 is correct Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix print vs printf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix float placeholder Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: ensure fp16 and fp32 load and stores are called Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fp16 load ensured to hit Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove sigint from fp16 store for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: nnpa switch to vec_xst test Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to vec_xst for 4 element loops also Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework noop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove noop, general code cleanup Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify variable naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add breakpoint for debugging Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test fix for conversion failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: disable fp32->fp16 nnpa conversions for now there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to elif macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: reattempt fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix compiler types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: change to typedef vector types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add 4 element loops for fp32->fp16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarified vector naming Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back fp32->fp16 store nnpa Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add nnpa macro check in ggml-impl Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add missing __func__ Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: diagnose why __NNPA__ macro is not being defined Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: import vecintrin.h to fix compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update macro tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 157f856c34589566151630e294563a420702db39. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to importing ggml-cpu-impl instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix macro declaration Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: test more macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add debug prints Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bruteforce macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move macro definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h to cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to private macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move s390x typedef to own header file Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 157f856c34589566151630e294563a420702db39) * ggml-cpu: move things around Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile macros Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: switch to quotes for import Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add compiler error macro Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add s390x detection in ggml-src Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: undo cmakelists work Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move s390x typedef to own header file" This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedefs.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove typedef from cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add ggml-impl.h future notes Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add todo comment for future reference Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: clarify naming of dlf16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove unnecessary target compile definitions Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update broken huggingface link for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix duplicate func names during compile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: fix duplicate func names during compile" This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu" This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: refactor fp16<->fp32 simd to ggml-cpu Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h import in quants.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix missing simd-mappings.h within repack Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix amx mmq missing simd-mappings.h Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: attempt at fixing loongarch failing build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move nnpa together with other fp16<->fp32 simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix wrong refactor of ggml-base ref: ggml-org/llama.cpp#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: remove dependency on ggml-cpu from ggml-base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu ref: ggml-org/llama.cpp#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: remove mistaken fallback macro fallback logic was already implemented but i was too sleepy to realise Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org/llama.cpp#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures" This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml: move ggml_table_f32_f16 to ggml-cpu" This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml: move ggml_table_f32_f16 to ggml-cpu ref: ggml-org/llama.cpp#14317 (comment) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4) * ggml: move ggml_table_f32_f16 to ggml-cpu.c Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: extern c ggml_table_f32_f16 + chore docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h we rely on the variable declaration in ggml-cpu.c instead Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h" This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: bring back ggml_table_f32_f16 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: bring back ggml_table_f32_f16" This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * fix ggml time initialization * fix f32_f16 table init * remove extra line --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: slaren <slarengh@gmail.com>
1 parent 40b8547 commit 5f9e97f

27 files changed

+963
-841
lines changed

ggml/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,7 @@ option(GGML_RVV "ggml: enable rvv" ON)
131131
option(GGML_RV_ZFH "ggml: enable riscv zfh" OFF)
132132
option(GGML_XTHEADVECTOR "ggml: enable xtheadvector" OFF)
133133
option(GGML_VXE "ggml: enable vxe" ON)
134+
option(GGML_NNPA "ggml: enable nnpa" ON)
134135

135136
option(GGML_CPU_ALL_VARIANTS "ggml: build all variants of the CPU backend (requires GGML_BACKEND_DL)" OFF)
136137
set(GGML_CPU_ARM_ARCH "" CACHE STRING "ggml: CPU architecture for ARM")

ggml/include/ggml-cpu.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ extern "C" {
101101
GGML_BACKEND_API int ggml_cpu_has_riscv_v (void);
102102
GGML_BACKEND_API int ggml_cpu_has_vsx (void);
103103
GGML_BACKEND_API int ggml_cpu_has_vxe (void);
104+
GGML_BACKEND_API int ggml_cpu_has_nnpa (void);
104105
GGML_BACKEND_API int ggml_cpu_has_wasm_simd (void);
105106
GGML_BACKEND_API int ggml_cpu_has_llamafile (void);
106107

ggml/src/ggml-cpu/CMakeLists.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -448,6 +448,7 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
448448

449449
# TODO: Separation to determine activation of VX/VXE/VXE2
450450
if (${S390X_M} MATCHES "8561|8562")
451+
set(GGML_NNPA OFF)
451452
message(STATUS "z15 target")
452453
list(APPEND ARCH_FLAGS -march=z15)
453454
elseif (${S390X_M} MATCHES "3931")
@@ -464,7 +465,14 @@ function(ggml_add_cpu_backend_variant_impl tag_name)
464465
endif()
465466

466467
if (GGML_VXE)
468+
message(STATUS "VX/VXE/VXE2 enabled")
467469
list(APPEND ARCH_FLAGS -mvx -mzvector)
470+
list(APPEND ARCH_DEFINITIONS GGML_VXE)
471+
endif()
472+
473+
if (GGML_NNPA)
474+
message(STATUS "NNPA enabled")
475+
list(APPEND ARCH_DEFINITIONS GGML_NNPA)
468476
endif()
469477
elseif (CMAKE_SYSTEM_PROCESSOR MATCHES "wasm")
470478
message(STATUS "Wasm detected")

ggml/src/ggml-cpu/amx/mmq.cpp

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
#include "mmq.h"
99
#include "ggml-impl.h"
1010
#include "ggml-cpu-impl.h"
11+
#include "simd-mappings.h"
1112
#include "quants.h"
1213
#include "ggml-quants.h"
1314
#include <algorithm>
@@ -453,7 +454,7 @@ void quantize_row_q8_K_vnni(const float * RESTRICT x, void * RESTRICT vy, int64_
453454

454455
// Quantize these floats
455456
const float iscale = 127.f / amax;
456-
y[i].d = GGML_FP32_TO_FP16(1 / iscale);
457+
y[i].d = GGML_CPU_FP32_TO_FP16(1 / iscale);
457458
const float id = ( amax != 0.0f ) ? iscale : 0.f;
458459
const __m512 vscale = _mm512_set1_ps(id);
459460

@@ -1090,7 +1091,7 @@ struct acc_C<block_q8_0, block_q4_0, is_acc> {
10901091
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
10911092

10921093
for (int m = 0; m < nr; ++m) {
1093-
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1094+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
10941095
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
10951096

10961097
__m512 vsum;
@@ -1113,8 +1114,8 @@ struct acc_C<block_q8_1, block_q4_1, is_acc> {
11131114
const __m512 vm0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset + TILE_N * sizeof(ggml_half))));
11141115

11151116
for (int m = 0; m < nr; ++m) {
1116-
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1117-
const __m512 vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].s));
1117+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
1118+
const __m512 vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].s));
11181119
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
11191120

11201121
__m512 vsum;
@@ -1137,7 +1138,7 @@ struct acc_C<block_q8_0, block_q8_0, is_acc> {
11371138
const __m512 vd0 = _mm512_cvtph_ps(_mm256_loadu_si256((const __m256i *)((const char *)packed_B + offset)));
11381139

11391140
for (int m = 0; m < nr; ++m) {
1140-
const __m512 vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[m * lda].d));
1141+
const __m512 vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[m * lda].d));
11411142
const __m512 vtile = _mm512_cvtepi32_ps(_mm512_loadu_si512(tile + m * TILE_N));
11421143

11431144
__m512 vsum;
@@ -1437,7 +1438,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q4_0, float, BLOCK_M, BLOCK_N, BLO
14371438
va[k] = _mm512_set1_epi32(a_ptr[k]);
14381439
vcomp = _mm512_dpbusd_epi32(vcomp, off, va[k]);
14391440
}
1440-
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1441+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
14411442
}
14421443

14431444
// load b
@@ -1498,8 +1499,8 @@ struct tinygemm_kernel_vnni<block_q8_1, block_q4_1, float, 1, BLOCK_N, BLOCK_K>
14981499
for (int k = 0; k < 8; ++k) {
14991500
va[k] = _mm512_set1_epi32(a_ptr[k]);
15001501
}
1501-
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1502-
vs1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].s));
1502+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
1503+
vs1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].s));
15031504
}
15041505

15051506
// load b
@@ -1571,7 +1572,7 @@ struct tinygemm_kernel_vnni<block_q8_0, block_q8_0, float, BLOCK_M, BLOCK_N, BLO
15711572
va[k] = _mm512_set1_epi32(a_ptr[k]);
15721573
va[k] = _mm512_add_epi8(va[k], off);
15731574
}
1574-
vd1 = _mm512_set1_ps(GGML_FP16_TO_FP32(A[0 * KB + i].d));
1575+
vd1 = _mm512_set1_ps(GGML_CPU_FP16_TO_FP32(A[0 * KB + i].d));
15751576
}
15761577

15771578
// load b

0 commit comments

Comments
 (0)