Add strength reduction benchmarks #4317
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds strength reduction benchmarks for arrays of a few different element sizes, motivated by the differences in codegen. The element sizes give different characteristics of how we access each element. For x64, the current instruction codegen looks like:
2: load
3: lea + load
4: load
8: load
12: lea + load
16: shl + load
29: imul + load
Each size has 3 variants of benchmarks: an array version, a span version, and a fully strength reduced manual version. The JIT is expected to be able to transform the array version into the strength reduced version soon. The span version will also be transformed, but not quite all the way (the strength reduction will not be able to fold in the base byref of the span).
There is one current annoyance to work around in the JIT: we do not align the strength-reduced versions of the loops because they end up being "too small", meaning that they still fit within a single cache line. However, it turns out alignment is still beneficial in these cases, and this skews the results compared to the non-strength reduced versions. I have opened dotnet/runtime#104665 about this. To work around the problem in these benchmarks I have added a superfluous bitwise or operation in the body of all the loops.
On my Intel CPU the current results are: