Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS #68961

Open
bremac opened this issue Aug 20, 2024 · 7 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@bremac
Copy link

bremac commented Aug 20, 2024

Go version

go version go1.23.0 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/bremac/.cache/go-build'
GOENV='/home/bremac/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/bremac/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/bremac/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/bremac/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3831806201=/tmp/go-build -gno-record-gcc-switches'

What did you do?

This code is a simplified form of a longer unrolled loop, with the non-carry-related logic removed:

func example(carryIn uint, x, y, result []uint) uint {
	// Check lengths up-front to simplify the code generated for the loop
	if len(x) != len(y) || len(x) != len(result) {
		panic("length mismatch")
	}
	for i := 0; i < len(x); i++ {
		result[i], carryIn = bits.Add(x[i], y[i], carryIn)
	}
	return carryIn
}

https://go.dev/play/p/gGVkiLN6qbV
https://go.godbolt.org/z/W313f1EYG

What did you see happen?

On amd64, the compiled loop has a throughput of one iteration every four cycles:

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SBBQ    AX, AX
        NEGQ    AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

The bottleneck is the NEGL -> ADCQ -> SBBQ -> NEGQ dependency chain.

What did you expect to see?

The SBBQ / NEGQ pair should use SETCS instead, e.g.

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SETCS   AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

This shortens the dependency chain to three instructions.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 20, 2024
@cherrymui cherrymui added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 23, 2024
@cherrymui
Copy link
Member

cc @golang/compiler

@cherrymui cherrymui added this to the Backlog milestone Aug 23, 2024
@randall77
Copy link
Contributor

SETCS only sets the low byte of the target register. The upper 56 bytes are unmodified. Because the full 64 bits are returned, we have to zero the upper 56 bits at some point.

If we knew that the input was only 0 or 1 then the zeroing would be unnecessary. Maybe we can infer that because it is passed as an arg to bits.Add (which has that as a requirement in its spec)?

Maybe we could use only the low byte within the loop and extend it just on return, but that's tricky. Calling example with carryIn==0xaaaa and len(x)==0 should return 0xaaaa. So unconditional zero extending after the loop isn't right.

We could use MOVL $0, AX; SETCS AX instead of SBBQ/NEGQ. Not sure that helps much.

@randall77
Copy link
Contributor

Another option is to represent the carry as 0/-1 instead of 0/1. Then we don't need the NEGQ after the SBBQ.
That ship has already sailed for bits.Add, but we could have one NEGQ each before and after the loop to convert from/to the 0/1 the caller expects.

@bremac
Copy link
Author

bremac commented Aug 27, 2024

That's a great point about the high bits — it would violate bits.Add's guarantee that carryOut is 0 or 1 if the high bits were not cleared.

In any case, MOVL $0, AX; SETCS AX or XOR AX, AX; SETCS AX still would be an improvement over SBBQ / NEGQ. It's the same number of instructions, but the zeroing idiom would break the data dependency on AX and cut the dependency chain to three instructions.

@randall77
Copy link
Contributor

XOR AX, AX; SETCS AX

XOR clears the carry flag, that one won't work. Unless we push the XOR earlier than the carry-generating instruction, which is difficult to do unless we unroll the loop once.

the zeroing idiom would break the data dependency on AX

I believe SBBQ AX, AX is like XORQ AX, AX in that the chip knows there is no dependency on AX. At least for some chips... And in any case, AX is available, we just used it earlier (both PC-wise and dataflow-wise).

cut the dependency chain to three instructions.

That would be nice. I think my 0/-1 idea does that also. (But really unrolling is probably the better solution here. See math/big/arith_amd64.s:addVV. It's a 1 instruction dependency chain within the loop.)

@bremac
Copy link
Author

bremac commented Aug 28, 2024

XOR clears the carry flag, that one won't work. Unless we push the XOR earlier than the carry-generating instruction, which is difficult to do unless we unroll the loop once.

🤦 Of course, sorry.

I believe SBBQ AX, AX is like XORQ AX, AX in that the chip knows there is no dependency on AX. At least for some chips... And in any case, AX is available, we just used it earlier (both PC-wise and dataflow-wise).

It does on the Zen microarchitectures, but none of the Intel chips that I know of. In any case, SBBQ AX, AX still depends on the carry flag from ADCQ, so it remains part of the dependency chain.

I think my 0/-1 idea does that also. (But really unrolling is probably the better solution here. See math/big/arith_amd64.s:addVV. It's a 1 instruction dependency chain within the loop.)

Unrolling helps, though I'm not sure it's an either-or situation. Outside of pure bignum arithmetic, the amount you can usefully unroll a loop involving carries on amd64 is limited. The example I gave is cut down from a more complex loop I was converting from C. That loop was unrolled with a stride of 2. Other bitwise logic in original loop caused either register or carry flag spilling if the loop was unrolled further. For that loop, the 0/-1 representation or mov+setcs should still speed it up by 20% after unrolling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

5 participants