Skip to content

Applying simd to AnnUpdateSgradient #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mingodad
Copy link
Contributor

It seems that gcc can't optimise this so using simd here give us a good performance gain.

@mingodad
Copy link
Contributor Author

There is this answer on stack overflow that could be worth try: https://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector

If the input array is potentially large, it's worth having a scalar loop at the start, too, that runs 0-3 times until the input is aligned on a 16B boundary for the SSE loop. Then you won't have loads that cross cache/page lines slowing down your loop. And it can use ADDPS with a memory operand, which can potentially micro-fuse, reducing overhead. Also, you could get 2 or 4 dependency chains going, by using multiple accumulators, so your loop could sustain 1 vector FP add per cycle, instead of 1 per (latency of ADDPS = 3). – Peter Cordes Jul 5 '15 at 14:57

@mingodad
Copy link
Contributor Author

mingodad commented Jun 22, 2018

It seems that still there is something to do because there is not much difference between using AVX (__mm256) and AVX512 (__mm512).

Here is the 10 first outputs of nn-benchmark unsing "-march=native" on this machine:

cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
stepping	: 4
microcode	: 0x1
cpu MHz		: 2693.672
cache size	: 33792 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves pku ospke
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5387.34
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:
./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 568 ms per cycle
[2] Error: 86.800003% -- 570 ms per cycle
[3] Error: 84.500000% -- 633 ms per cycle
[4] Error: 86.300003% -- 620 ms per cycle
[5] Error: 85.900002% -- 697 ms per cycle
[6] Error: 85.400002% -- 732 ms per cycle
[7] Error: 84.699997% -- 774 ms per cycle
[8] Error: 86.000000% -- 754 ms per cycle
[9] Error: 86.900002% -- 734 ms per cycle

./nn-benchmark-sse
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 257 ms per cycle
[2] Error: 100.000000% -- 258 ms per cycle
[3] Error: 100.000000% -- 259 ms per cycle
[4] Error: 100.000000% -- 259 ms per cycle
[5] Error: 100.000000% -- 259 ms per cycle
[6] Error: 100.000000% -- 259 ms per cycle
[7] Error: 100.000000% -- 259 ms per cycle
[8] Error: 100.000000% -- 260 ms per cycle
[9] Error: 100.000000% -- 260 ms per cycle

./nn-benchmark-avx
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 198 ms per cycle
[2] Error: 100.000000% -- 197 ms per cycle
[3] Error: 100.000000% -- 197 ms per cycle
[4] Error: 100.000000% -- 197 ms per cycle
[5] Error: 100.000000% -- 197 ms per cycle
[6] Error: 100.000000% -- 197 ms per cycle
[7] Error: 100.000000% -- 197 ms per cycle
[8] Error: 100.000000% -- 198 ms per cycle
[9] Error: 100.000000% -- 198 ms per cycle

./nn-benchmark-avx512
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 163 ms per cycle
[2] Error: 100.000000% -- 163 ms per cycle
[3] Error: 100.000000% -- 163 ms per cycle
[4] Error: 100.000000% -- 163 ms per cycle
[5] Error: 100.000000% -- 163 ms per cycle
[6] Error: 100.000000% -- 163 ms per cycle
[7] Error: 100.000000% -- 162 ms per cycle
[8] Error: 100.000000% -- 162 ms per cycle
[9] Error: 100.000000% -- 162 ms per cycle

Makefile:

all: nn-test-1 nn-test-2 nn-benchmark-generic \
    nn-benchmark-sse nn-benchmark-sse-pg \
    nn-benchmark-avx nn-benchmark-avx512

nn-test-1: nn-test-1.c ../nn.c ../nn.h
	$(CC) nn-test-1.c ../nn.c -DUSE_SSE -march=native -Wall -W -O2 -o nn-test-1 -lm

nn-test-2: nn-test-2.c ../nn.c ../nn.h
	$(CC) nn-test-2.c ../nn.c -Wall -W -O2 -o nn-test-2 -lm

nn-benchmark-generic: nn-benchmark.c ../nn.c ../nn.h
	$(CC) nn-benchmark.c ../nn.c -march=native -Wall -W -O3 -o nn-benchmark-generic -lm

nn-benchmark-sse-pg: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_SSE -march=native nn-benchmark.c ../nn.c -Wall -W -O2 -pg -g -o nn-benchmark-sse-pg -lm

nn-benchmark-sse: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_SSE -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-sse -lm

nn-benchmark-avx: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_AVX -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-avx -lm

nn-benchmark-avx512: nn-benchmark.c ../nn.c ../nn.h
	$(CC) -DUSE_AVX512 -march=native nn-benchmark.c ../nn.c -Wall -W -O3 -o nn-benchmark-avx512 -lm

clean:
	rm -f nn-test-1 nn-test-2 nn-benchmark-generic \
    nn-benchmark-sse nn-benchmark-sse-pg nn-benchmark-avx nn-benchmark-avx512

@mingodad
Copy link
Contributor Author

Here is the nn-benchmark on a low-end cpu:

cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N570   @ 1.66GHz
stepping	: 10
microcode	: 0x107
cpu MHz		: 1667.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm retpoline kaiser tpr_shadow vnmi flexpriority dtherm
bugs		: cpu_meltdown spectre_v1 spectre_v2
bogomips	: 3333.17
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 3465 ms per cycle
[2] Error: 86.800003% -- 3469 ms per cycle
[3] Error: 84.500000% -- 5628 ms per cycle
[4] Error: 86.300003% -- 5105 ms per cycle
[5] Error: 85.900002% -- 6924 ms per cycle
[6] Error: 85.400002% -- 7342 ms per cycle
[7] Error: 84.699997% -- 8383 ms per cycle
[8] Error: 86.000000% -- 8042 ms per cycle
[9] Error: 86.900002% -- 7864 ms per cycle

./nn-benchmark-sse
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 2559 ms per cycle
[2] Error: 100.000000% -- 2551 ms per cycle
[3] Error: 100.000000% -- 2548 ms per cycle
[4] Error: 100.000000% -- 2547 ms per cycle
[5] Error: 100.000000% -- 2545 ms per cycle
[6] Error: 100.000000% -- 2546 ms per cycle
[7] Error: 100.000000% -- 2545 ms per cycle
[8] Error: 100.000000% -- 2545 ms per cycle
[9] Error: 100.000000% -- 2544 ms per cycle


@mingodad
Copy link
Contributor Author

And on a Nexus5:

./nn-benchmark-generic
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 4029 ms per cycle
...
./nn-benchmark-neon
[0] Error: 100.000000% -- 0 ms per cycle
[1] Error: 100.000000% -- 2028 ms per cycle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant