cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

bremac · 2024-08-20T05:25:42Z

Go version

go version go1.23.0 linux/amd64

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/bremac/.cache/go-build'
GOENV='/home/bremac/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/bremac/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/bremac/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/bremac/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3831806201=/tmp/go-build -gno-record-gcc-switches'

What did you do?

This code is a simplified form of a longer unrolled loop, with the non-carry-related logic removed:

func example(carryIn uint, x, y, result []uint) uint {
	// Check lengths up-front to simplify the code generated for the loop
	if len(x) != len(y) || len(x) != len(result) {
		panic("length mismatch")
	}
	for i := 0; i < len(x); i++ {
		result[i], carryIn = bits.Add(x[i], y[i], carryIn)
	}
	return carryIn
}

https://go.dev/play/p/gGVkiLN6qbV
https://go.godbolt.org/z/W313f1EYG

What did you see happen?

On amd64, the compiled loop has a throughput of one iteration every four cycles:

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SBBQ    AX, AX
        NEGQ    AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

The bottleneck is the NEGL -> ADCQ -> SBBQ -> NEGQ dependency chain.

What did you expect to see?

The SBBQ / NEGQ pair should use SETCS instead, e.g.

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SETCS   AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

This shortens the dependency chain to three instructions.

The text was updated successfully, but these errors were encountered:

gabyhelp · 2024-08-20T05:26:16Z

Related Issues and Documentation

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

cherrymui · 2024-08-23T00:49:46Z

cc @golang/compiler

randall77 · 2024-08-27T21:25:55Z

SETCS only sets the low byte of the target register. The upper 56 bytes are unmodified. Because the full 64 bits are returned, we have to zero the upper 56 bits at some point.

If we knew that the input was only 0 or 1 then the zeroing would be unnecessary. Maybe we can infer that because it is passed as an arg to bits.Add (which has that as a requirement in its spec)?

Maybe we could use only the low byte within the loop and extend it just on return, but that's tricky. Calling example with carryIn==0xaaaa and len(x)==0 should return 0xaaaa. So unconditional zero extending after the loop isn't right.

We could use MOVL $0, AX; SETCS AX instead of SBBQ/NEGQ. Not sure that helps much.

randall77 · 2024-08-27T21:30:36Z

Another option is to represent the carry as 0/-1 instead of 0/1. Then we don't need the NEGQ after the SBBQ.
That ship has already sailed for bits.Add, but we could have one NEGQ each before and after the loop to convert from/to the 0/1 the caller expects.

bremac · 2024-08-27T21:55:19Z

That's a great point about the high bits — it would violate bits.Add's guarantee that carryOut is 0 or 1 if the high bits were not cleared.

In any case, MOVL $0, AX; SETCS AX or XOR AX, AX; SETCS AX still would be an improvement over SBBQ / NEGQ. It's the same number of instructions, but the zeroing idiom would break the data dependency on AX and cut the dependency chain to three instructions.

randall77 · 2024-08-27T22:21:56Z

XOR AX, AX; SETCS AX

XOR clears the carry flag, that one won't work. Unless we push the XOR earlier than the carry-generating instruction, which is difficult to do unless we unroll the loop once.

the zeroing idiom would break the data dependency on AX

I believe SBBQ AX, AX is like XORQ AX, AX in that the chip knows there is no dependency on AX. At least for some chips... And in any case, AX is available, we just used it earlier (both PC-wise and dataflow-wise).

cut the dependency chain to three instructions.

That would be nice. I think my 0/-1 idea does that also. (But really unrolling is probably the better solution here. See math/big/arith_amd64.s:addVV. It's a 1 instruction dependency chain within the loop.)

bremac · 2024-08-28T01:28:38Z

XOR clears the carry flag, that one won't work. Unless we push the XOR earlier than the carry-generating instruction, which is difficult to do unless we unroll the loop once.

🤦 Of course, sorry.

I believe SBBQ AX, AX is like XORQ AX, AX in that the chip knows there is no dependency on AX. At least for some chips... And in any case, AX is available, we just used it earlier (both PC-wise and dataflow-wise).

It does on the Zen microarchitectures, but none of the Intel chips that I know of. In any case, SBBQ AX, AX still depends on the carry flag from ADCQ, so it remains part of the dependency chain.

I think my 0/-1 idea does that also. (But really unrolling is probably the better solution here. See math/big/arith_amd64.s:addVV. It's a 1 instruction dependency chain within the loop.)

Unrolling helps, though I'm not sure it's an either-or situation. Outside of pure bignum arithmetic, the amount you can usefully unroll a loop involving carries on amd64 is limited. The example I gave is cut down from a more complex loop I was converting from C. That loop was unrolled with a stride of 2. Other bitwise logic in original loop caused either register or carry flag spilling if the loop was unrolled further. For that loop, the 0/-1 representation or mov+setcs should still speed it up by 20% after unrolling.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 20, 2024

mknyszek added this to Go Compiler / Runtime Aug 21, 2024

cherrymui added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 23, 2024

cherrymui added this to the Backlog milestone Aug 23, 2024

randall77 added the help wanted label Aug 28, 2024

randall77 moved this to Todo in Go Compiler / Runtime Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

bremac commented Aug 20, 2024

gabyhelp commented Aug 20, 2024

cherrymui commented Aug 23, 2024

randall77 commented Aug 27, 2024

randall77 commented Aug 27, 2024

bremac commented Aug 27, 2024

randall77 commented Aug 27, 2024

bremac commented Aug 28, 2024

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

Comments

bremac commented Aug 20, 2024

Go version

Output of go env in your module/workspace:

What did you do?

What did you see happen?

What did you expect to see?

gabyhelp commented Aug 20, 2024

cherrymui commented Aug 23, 2024

randall77 commented Aug 27, 2024

randall77 commented Aug 27, 2024

bremac commented Aug 27, 2024

randall77 commented Aug 27, 2024

bremac commented Aug 28, 2024

Output of `go env` in your module/workspace: