cmd/compile: inefficient code generation on amd64

### Probably, inefficient code generation on amd64.

Probably a slight regression in the performance of the resulting code beetween g.14.6 and master (8696ae82c9).

#### Preface

Examining the assembly code listing, I found out that the compiler generates sub-optimal code for some part of the code.
I will not describe the whole journey. I found a code (code from benchmark game) illustrating the problem and will use it for an example.

All source code: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-go-3.html

Let's pay attention to these lines (60, 61, 62), it's "advance" func:

```go
dx := sys.s[i].x - sys.s[j].x
dy := sys.s[i].y - sys.s[j].y
dz := sys.s[i].z - sys.s[j].z
```

Let's take a look at the result of the code generation:

```
  main.go:60    LEAQ 0(CX)(CX*2), DI         // DI = i, BX = j
  main.go:60    LEAQ main.sys+160(SB), R8

  main.go:60    MOVSD_XMM 0(R8)(DI*8), X4    // load x +0x0
  main.go:60    LEAQ 0(BX)(BX*2), R9         // calculate offset for .x
  main.go:60    LEAQ 0(R8)(R9*8), R10        // sys.s[j] + offset
  main.go:60    SUBSD 0(R10), X4

  main.go:61    MOVSD_XMM 0x8(R8)(DI*8), X5  // load y +0x8
  main.go:61    LEAQ 0(R8)(R9*8), R10
  main.go:61    LEAQ 0x8(R10), R10
  main.go:61    SUBSD 0(R10), X5

  main.go:62    MOVSD_XMM 0x10(R8)(DI*8), X6 // load z +0x10
  main.go:62    LEAQ 0(R8)(R9*8), DI
  main.go:62    LEAQ 0x10(DI), DI
  main.go:62    SUBSD 0(DI), X6
```

We need one instruction to load the value of the left operand and two to calculate the address of the right one.

I rolled back to version go1.14 and found that the codegen there is different. For the same lines, the following code will be generated:

```
  main.go:60    0x49f77d    LEAQ 0(CX)(CX*2), DI              // DI = 0
  main.go:60    0x49f781    LEAQ main.sys+160(SB), R8         // R8 = Positions addr

  main.go:60    0x49f788    MOVSD_XMM 0(R8)(DI*8), X4         // load sys.s[i].x
  main.go:60    0x49f78e    LEAQ 0(BX)(BX*2), R9
  main.go:60    0x49f792    MOVSD_XMM 0(R8)(R9*8), X5         // load sys.s[j].x
  main.go:60    0x49f798    SUBSD X5, X4

  main.go:61    0x49f79c    MOVSD_XMM 0x8(R8)(DI*8), X5
  main.go:61    0x49f7a3    MOVSD_XMM 0x8(R8)(R9*8), X6       // dy := sys.s[i].y - sys.s[j].y
  main.go:61    0x49f7aa    SUBSD X6, X5

  main.go:62    0x49f7ae    MOVSD_XMM 0x10(R8)(DI*8), X6
  main.go:62    0x49f7b5    MOVSD_XMM 0x10(R8)(R9*8), X7      // dz := sys.s[i].z - sys.s[j].z
  main.go:62    0x49f7bc    SUBSD X7, X6
```

It uses one instruction to load both the left and right side of the expression.

#### How does it affect code performance?

It depends on how often such a code generation pattern is used in the code and in how hot the pieces of code are.
But we can write benchmark:

```go
package main

import (
    "testing"
)

func BenchmarkT(b *testing.B) {
    offsetMomentum()
    c := 500000

    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        for i := 0; i < c; i++ {
            advance(0.01)
        }
    }

    _ = energy()
}
```

Results:

```
$ bin for i in {1..20}; do go.1.14 test -bench=. >> old.txt; done
$ bin for i in {1..20}; do go.master test -bench=. >> new.txt; done
$ bin benchstat old.txt new.txt
name  old time/op  new time/op  delta
T-4   48.9ms ± 5%  54.5ms ±10%  +11.46%  (p=0.000 n=19+19)
```

#### When did the issue start?

git bisect report beetween master(8696ae82c9) and go1.14:

```shell script
... skip all steps

98cb76799c3053e779c4e1b61bb50705d25dd77f is the first bad commit
commit 98cb76799c3053e779c4e1b61bb50705d25dd77f
Author: Keith Randall <khr@golang.org>
Date:   Thu Jan 30 10:17:01 2020 -0800

    cmd/compile: insert complicated x86 addressing modes as a separate pass

    Use a separate compiler pass to introduce complicated x86 addressing
    modes.  Loads in the normal architecture rules (for x86 and all other
    platforms) can have constant offsets (AuxInt values) and symbols (Aux
    values), but no more.

    The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
    separate pass that combines loads with LEAQx ops.

    Organizing rewrites this way simplifies the number of rewrites
    required, as there are lots of different rule orderings that have to
    be specified to ensure these complex addressing modes are always found
    if they are possible.

    Update #36468

    Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
    Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
    Run-TryBot: Keith Randall <khr@golang.org>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>

 src/cmd/compile/internal/ssa/addressingmodes.go |   154 +
 src/cmd/compile/internal/ssa/compile.go         |     1 +
 src/cmd/compile/internal/ssa/gen/AMD64.rules    |   699 +-
 src/cmd/compile/internal/ssa/rewrite.go         |    40 +
 src/cmd/compile/internal/ssa/rewriteAMD64.go    | 10427 ++++++----------------
 test/codegen/memops.go                          |    88 +
 6 files changed, 3233 insertions(+), 8176 deletions(-)
 create mode 100644 src/cmd/compile/internal/ssa/addressingmodes.go
```

Issue linked with changes - https://github.com/golang/go/issues/36468

/cc @randall77 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: inefficient code generation on amd64 #40426

Probably, inefficient code generation on amd64.

Preface

How does it affect code performance?

When did the issue start?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: inefficient code generation on amd64 #40426

Description

Probably, inefficient code generation on amd64.

Preface

How does it affect code performance?

When did the issue start?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions