Applying simd to AnnUpdateSgradient #18

mingodad · 2018-06-22T08:49:12Z

It seems that gcc can't optimise this so using simd here give us a good performance gain.

…od performance gain.

mingodad · 2018-06-22T09:01:12Z

There is this answer on stack overflow that could be worth try: https://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector

If the input array is potentially large, it's worth having a scalar loop at the start, too, that runs 0-3 times until the input is aligned on a 16B boundary for the SSE loop. Then you won't have loads that cross cache/page lines slowing down your loop. And it can use ADDPS with a memory operand, which can potentially micro-fuse, reducing overhead. Also, you could get 2 or 4 dependency chains going, by using multiple accumulators, so your loop could sustain 1 vector FP add per cycle, instead of 1 per (latency of ADDPS = 3). – Peter Cordes Jul 5 '15 at 14:57

mingodad · 2018-06-22T09:20:12Z

It seems that still there is something to do because there is not much difference between using AVX (__mm256) and AVX512 (__mm512).

Here is the 10 first outputs of nn-benchmark unsing "-march=native" on this machine:

cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
stepping	: 4
microcode	: 0x1
cpu MHz		: 2693.672
cache size	: 33792 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5387.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 568 ms per cycle
[2] Error: 86.800003% -- 570 ms per cycle
[3] Error: 84.500000% -- 633 ms per cycle
[4] Error: 86.300003% -- 620 ms per cycle
[5] Error: 85.900002% -- 697 ms per cycle
[6] Error: 85.400002% -- 732 ms per cycle
[7] Error: 84.699997% -- 774 ms per cycle
[8] Error: 86.000000% -- 754 ms per cycle
[9] Error: 86.900002% -- 734 ms per cycle

./nn-benchmark-sse
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 257 ms per cycle
[2] Error: 100.000000% -- 258 ms per cycle
[3] Error: 100.000000% -- 259 ms per cycle
[4] Error: 100.000000% -- 259 ms per cycle
[5] Error: 100.000000% -- 259 ms per cycle
[6] Error: 100.000000% -- 259 ms per cycle
[7] Error: 100.000000% -- 259 ms per cycle
[8] Error: 100.000000% -- 260 ms per cycle
[9] Error: 100.000000% -- 260 ms per cycle

./nn-benchmark-avx
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 198 ms per cycle
[2] Error: 100.000000% -- 197 ms per cycle
[3] Error: 100.000000% -- 197 ms per cycle
[4] Error: 100.000000% -- 197 ms per cycle
[5] Error: 100.000000% -- 197 ms per cycle
[6] Error: 100.000000% -- 197 ms per cycle
[7] Error: 100.000000% -- 197 ms per cycle
[8] Error: 100.000000% -- 198 ms per cycle
[9] Error: 100.000000% -- 198 ms per cycle

./nn-benchmark-avx512
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 163 ms per cycle
[2] Error: 100.000000% -- 163 ms per cycle
[3] Error: 100.000000% -- 163 ms per cycle
[4] Error: 100.000000% -- 163 ms per cycle
[5] Error: 100.000000% -- 163 ms per cycle
[6] Error: 100.000000% -- 163 ms per cycle
[7] Error: 100.000000% -- 162 ms per cycle
[8] Error: 100.000000% -- 162 ms per cycle
[9] Error: 100.000000% -- 162 ms per cycle

Makefile:

all: nn-test-1 nn-test-2 nn-benchmark-generic \
    nn-benchmark-sse nn-benchmark-sse-pg \
    nn-benchmark-avx nn-benchmark-avx512

nn-test-1: nn-test-1.c ../nn.c ../nn.h
	$(CC) nn-test-1.c ../nn.c -DUSE_SSE -march=native -Wall -W -O2 -o nn-test-1 -lm

nn-test-2: nn-test-2.c ../nn.c ../nn.h
	$(CC) nn-test-2.c ../nn.c -Wall -W -O2 -o nn-test-2 -lm

nn-benchmark-generic: nn-benchmark.c ../nn.c ../nn.h
	$(CC) nn-benchmark.c ../nn.c -march=native -Wall -W -O3 -o nn-benchmark-generic -lm

nn-benchmark-sse-pg: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_SSE -march=native nn-benchmark.c ../nn.c -Wall -W -O2 -pg -g -o nn-benchmark-sse-pg -lm

nn-benchmark-sse: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_SSE -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-sse -lm

nn-benchmark-avx: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_AVX -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-avx -lm

nn-benchmark-avx512: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_AVX512 -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-avx512 -lm

clean:
	rm -f nn-test-1 nn-test-2 nn-benchmark-generic \
    nn-benchmark-sse nn-benchmark-sse-pg nn-benchmark-avx nn-benchmark-avx512

mingodad · 2018-06-22T09:39:05Z

Here is the nn-benchmark on a low-end cpu:

cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N570   @ 1.66GHz
stepping	: 10
microcode	: 0x107
cpu MHz		: 1667.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm retpoline kaiser tpr_shadow vnmi flexpriority dtherm
bugs		: cpu_meltdown spectre_v1 spectre_v2
bogomips	: 3333.17
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 3465 ms per cycle
[2] Error: 86.800003% -- 3469 ms per cycle
[3] Error: 84.500000% -- 5628 ms per cycle
[4] Error: 86.300003% -- 5105 ms per cycle
[5] Error: 85.900002% -- 6924 ms per cycle
[6] Error: 85.400002% -- 7342 ms per cycle
[7] Error: 84.699997% -- 8383 ms per cycle
[8] Error: 86.000000% -- 8042 ms per cycle
[9] Error: 86.900002% -- 7864 ms per cycle

./nn-benchmark-sse
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 2559 ms per cycle
[2] Error: 100.000000% -- 2551 ms per cycle
[3] Error: 100.000000% -- 2548 ms per cycle
[4] Error: 100.000000% -- 2547 ms per cycle
[5] Error: 100.000000% -- 2545 ms per cycle
[6] Error: 100.000000% -- 2546 ms per cycle
[7] Error: 100.000000% -- 2545 ms per cycle
[8] Error: 100.000000% -- 2545 ms per cycle
[9] Error: 100.000000% -- 2544 ms per cycle

mingodad · 2018-06-22T09:54:48Z

And on a Nexus5:

./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 4029 ms per cycle
...
./nn-benchmark-neon
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 2028 ms per cycle

It seems that gcc can't optimize this so using simd here give us a go…

3fae82b

…od performance gain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Applying simd to AnnUpdateSgradient #18

Applying simd to AnnUpdateSgradient #18

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018 •

edited

Loading

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

Uh oh!

Applying simd to AnnUpdateSgradient #18

Are you sure you want to change the base?

Applying simd to AnnUpdateSgradient #18

Uh oh!

Conversation

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

mingodad commented Jun 22, 2018

Uh oh!

Uh oh!

mingodad commented Jun 22, 2018 •

edited

Loading