Skip to content

Conversation

@gbaraldi
Copy link
Member

@gbaraldi gbaraldi commented Jan 13, 2023

The regression in #48256 (comment) is only partially fixed by #48256. The first PR does enable the use of SIMD instructions via memmove, but it does not unroll the loop as it did before, that is for two reasons:

  • First we run LoopUnroll quite early in the pipeline so it sees more complicated code
  • Second, LLVM sees the runtime of our loops as weird and does not unroll unless we specifically allow runtime unrolling

This PR adds another LoopUnroll with runtime unrolling enabled after SLP vectorize, similar to how clang does.

Also enable VectorCombine because it does some nice things and was a todo for a couple of years.

We may also want to wait until we get LLVM15 because there might be a compile time hit for a second unrolling pass that was fixed on 15 https://discourse.llvm.org/t/loop-unrolling-in-large-functions-and-compile-time/67164

@gbaraldi gbaraldi requested review from vchuravy and vtjnash January 13, 2023 20:22
@gbaraldi
Copy link
Member Author

It would be nice if loop vectorize could unroll these loops for us, but it seems it doesn't like doing it for things like memmove/memcpy.

@gbaraldi
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@pchintalapudi
Copy link
Member

Could you update the newpm pipeline as well?

@gbaraldi
Copy link
Member Author

gbaraldi commented Jan 13, 2023

I just want to do a little bit more testing to see if we aren't unrolling too too much. On my example it does an 8x unroll where clang does a 4x, so we might want to limit the pass a bit.

@pchintalapudi
Copy link
Member

We'll see what happens after nanosoldier returns, but I do recall that runtime unrolling on newpm resulted in bad things happening on nanosoldier because it's doesn't have avx iirc (some 128-element vector appeared in the union benchmark I think)

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@gbaraldi
Copy link
Member Author

I tried a couple of the tests locally and the regressions and improvements seem to be mostly correct. So I guess it's kind of tradeoff. I kinda wish we could limit how much this pass can unroll, basically making it so that it either does a 4x, or doesn't unroll at all.

@pchintalapudi
Copy link
Member

Can we see the effect of just vectorcombine? That alone might give speedups that we want.

@pchintalapudi
Copy link
Member

Also, loop-unroll does have tuning options via command line flags or constructor arguments.

@gbaraldi
Copy link
Member Author

I did check, and some of speedups, for example in SIMD are due to the unrolling, the regressions in SIMD are also due to them.
Yeah, I was playing with the settings.

@gbaraldi gbaraldi closed this Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants