Description
Probably, inefficient code generation on amd64.
Probably a slight regression in the performance of the resulting code beetween g.14.6 and master (8696ae8).
Preface
Examining the assembly code listing, I found out that the compiler generates sub-optimal code for some part of the code.
I will not describe the whole journey. I found a code (code from benchmark game) illustrating the problem and will use it for an example.
All source code: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-go-3.html
Let's pay attention to these lines (60, 61, 62), it's "advance" func:
dx := sys.s[i].x - sys.s[j].x
dy := sys.s[i].y - sys.s[j].y
dz := sys.s[i].z - sys.s[j].z
Let's take a look at the result of the code generation:
main.go:60 LEAQ 0(CX)(CX*2), DI // DI = i, BX = j
main.go:60 LEAQ main.sys+160(SB), R8
main.go:60 MOVSD_XMM 0(R8)(DI*8), X4 // load x +0x0
main.go:60 LEAQ 0(BX)(BX*2), R9 // calculate offset for .x
main.go:60 LEAQ 0(R8)(R9*8), R10 // sys.s[j] + offset
main.go:60 SUBSD 0(R10), X4
main.go:61 MOVSD_XMM 0x8(R8)(DI*8), X5 // load y +0x8
main.go:61 LEAQ 0(R8)(R9*8), R10
main.go:61 LEAQ 0x8(R10), R10
main.go:61 SUBSD 0(R10), X5
main.go:62 MOVSD_XMM 0x10(R8)(DI*8), X6 // load z +0x10
main.go:62 LEAQ 0(R8)(R9*8), DI
main.go:62 LEAQ 0x10(DI), DI
main.go:62 SUBSD 0(DI), X6
We need one instruction to load the value of the left operand and two to calculate the address of the right one.
I rolled back to version go1.14 and found that the codegen there is different. For the same lines, the following code will be generated:
main.go:60 0x49f77d LEAQ 0(CX)(CX*2), DI // DI = 0
main.go:60 0x49f781 LEAQ main.sys+160(SB), R8 // R8 = Positions addr
main.go:60 0x49f788 MOVSD_XMM 0(R8)(DI*8), X4 // load sys.s[i].x
main.go:60 0x49f78e LEAQ 0(BX)(BX*2), R9
main.go:60 0x49f792 MOVSD_XMM 0(R8)(R9*8), X5 // load sys.s[j].x
main.go:60 0x49f798 SUBSD X5, X4
main.go:61 0x49f79c MOVSD_XMM 0x8(R8)(DI*8), X5
main.go:61 0x49f7a3 MOVSD_XMM 0x8(R8)(R9*8), X6 // dy := sys.s[i].y - sys.s[j].y
main.go:61 0x49f7aa SUBSD X6, X5
main.go:62 0x49f7ae MOVSD_XMM 0x10(R8)(DI*8), X6
main.go:62 0x49f7b5 MOVSD_XMM 0x10(R8)(R9*8), X7 // dz := sys.s[i].z - sys.s[j].z
main.go:62 0x49f7bc SUBSD X7, X6
It uses one instruction to load both the left and right side of the expression.
How does it affect code performance?
It depends on how often such a code generation pattern is used in the code and in how hot the pieces of code are.
But we can write benchmark:
package main
import (
"testing"
)
func BenchmarkT(b *testing.B) {
offsetMomentum()
c := 500000
b.ResetTimer()
for n := 0; n < b.N; n++ {
for i := 0; i < c; i++ {
advance(0.01)
}
}
_ = energy()
}
Results:
$ bin for i in {1..20}; do go.1.14 test -bench=. >> old.txt; done
$ bin for i in {1..20}; do go.master test -bench=. >> new.txt; done
$ bin benchstat old.txt new.txt
name old time/op new time/op delta
T-4 48.9ms ± 5% 54.5ms ±10% +11.46% (p=0.000 n=19+19)
When did the issue start?
git bisect report beetween master(8696ae8) and go1.14:
... skip all steps
98cb76799c3053e779c4e1b61bb50705d25dd77f is the first bad commit
commit 98cb76799c3053e779c4e1b61bb50705d25dd77f
Author: Keith Randall <khr@golang.org>
Date: Thu Jan 30 10:17:01 2020 -0800
cmd/compile: insert complicated x86 addressing modes as a separate pass
Use a separate compiler pass to introduce complicated x86 addressing
modes. Loads in the normal architecture rules (for x86 and all other
platforms) can have constant offsets (AuxInt values) and symbols (Aux
values), but no more.
The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
separate pass that combines loads with LEAQx ops.
Organizing rewrites this way simplifies the number of rewrites
required, as there are lots of different rule orderings that have to
be specified to ensure these complex addressing modes are always found
if they are possible.
Update #36468
Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
src/cmd/compile/internal/ssa/addressingmodes.go | 154 +
src/cmd/compile/internal/ssa/compile.go | 1 +
src/cmd/compile/internal/ssa/gen/AMD64.rules | 699 +-
src/cmd/compile/internal/ssa/rewrite.go | 40 +
src/cmd/compile/internal/ssa/rewriteAMD64.go | 10427 ++++++----------------
test/codegen/memops.go | 88 +
6 files changed, 3233 insertions(+), 8176 deletions(-)
create mode 100644 src/cmd/compile/internal/ssa/addressingmodes.go
Issue linked with changes - #36468
/cc @randall77