Skip to content
chunying edited this page Dec 27, 2022 · 2 revisions

firefly3399

peak A72: 1.8GHz *2MLA * 4 float/neon = 14.4 GFlops

  • test with MegPeak: fmla_x2 throughput: 1.116263 ns 14.333539 GFlops latency: 7.808628 ns
  • test with tengine 16x4 kernel:
void sgemm_A16_B4(float *mid_A, float *B, float *mid_B, float *C, int m, int n, int k)
{
    for (int i = 0; i < m; i += 16) {
        for (int j = 0; j < n; j += 4) {
            tengine_4x16_kernel(C, mid_B + j * k, mid_A + i * k, k);
        }
    }
}

firefly@chun:~/chun/Tengine_gemm_tutorial/step3$ taskset 0x10 ./test 
[m n k]:        512 512 512
[tengine 4x16]: 22.13 ms        , GFLOPS = 12.129880

12.12988/14.4 = 0.8423

✔️ this kernel can attained 84.23% peak performance

Clone this wiki locally