-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Add another loop unroll and enable VectorCombine #48271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It would be nice if loop vectorize could unroll these loops for us, but it seems it doesn't like doing it for things like memmove/memcpy. |
|
@nanosoldier |
|
Could you update the newpm pipeline as well? |
|
I just want to do a little bit more testing to see if we aren't unrolling too too much. On my example it does an 8x unroll where clang does a 4x, so we might want to limit the pass a bit. |
|
We'll see what happens after nanosoldier returns, but I do recall that runtime unrolling on newpm resulted in bad things happening on nanosoldier because it's doesn't have avx iirc (some 128-element vector appeared in the union benchmark I think) |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
I tried a couple of the tests locally and the regressions and improvements seem to be mostly correct. So I guess it's kind of tradeoff. I kinda wish we could limit how much this pass can unroll, basically making it so that it either does a 4x, or doesn't unroll at all. |
|
Can we see the effect of just vectorcombine? That alone might give speedups that we want. |
|
Also, loop-unroll does have tuning options via command line flags or constructor arguments. |
|
I did check, and some of speedups, for example in SIMD are due to the unrolling, the regressions in SIMD are also due to them. |
The regression in #48256 (comment) is only partially fixed by #48256. The first PR does enable the use of SIMD instructions via
memmove, but it does not unroll the loop as it did before, that is for two reasons:This PR adds another LoopUnroll with runtime unrolling enabled after SLP vectorize, similar to how clang does.
Also enable VectorCombine because it does some nice things and was a todo for a couple of years.
We may also want to wait until we get LLVM15 because there might be a compile time hit for a second unrolling pass that was fixed on 15 https://discourse.llvm.org/t/loop-unrolling-in-large-functions-and-compile-time/67164