-
Notifications
You must be signed in to change notification settings - Fork 101
Applying simd to AnnUpdateSgradient #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…od performance gain.
There is this answer on stack overflow that could be worth try: https://stackoverflow.com/questions/17761154/sse-reduction-of-float-vector If the input array is potentially large, it's worth having a scalar loop at the start, too, that runs 0-3 times until the input is aligned on a 16B boundary for the SSE loop. Then you won't have loads that cross cache/page lines slowing down your loop. And it can use ADDPS with a memory operand, which can potentially micro-fuse, reducing overhead. Also, you could get 2 or 4 dependency chains going, by using multiple accumulators, so your loop could sustain 1 vector FP add per cycle, instead of 1 per (latency of ADDPS = 3). – Peter Cordes Jul 5 '15 at 14:57 |
It seems that still there is something to do because there is not much difference between using AVX (__mm256) and AVX512 (__mm512). Here is the 10 first outputs of nn-benchmark unsing "-march=native" on this machine:
Makefile:
|
Here is the nn-benchmark on a low-end cpu:
./nn-benchmark-generic ./nn-benchmark-sse
|
And on a Nexus5:
|
It seems that gcc can't optimise this so using simd here give us a good performance gain.