New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

cmd/compile: implement more optimizations on loong64 #59120

Open

7 of 12 tasks

xen0n opened this issue Mar 19, 2023 · 16 comments

Open

7 of 12 tasks

cmd/compile: implement more optimizations on loong64 #59120

xen0n opened this issue Mar 19, 2023 · 16 comments

Labels

arch-loong64 compiler/runtime NeedsFix Performance

Milestone

Member

xen0n commented Mar 19, 2023 •

edited

Loading

xen0n added arch-loong64 compiler/runtime labels

xen0n added a commit to xen0n/go that referenced this issue


          cmd/asm, cmd/internal/obj/loong64: implement {8,16}-bit sign extensio…

…ns with EXTW{B,H}

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64

c5dc54f

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

33ca440

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

3b61257

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

660df64

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

db6a2ea

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

9b2a719

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

heschi added the NeedsFix label

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: WIP implement codegen of count-{leading,trailing}-ones o…

8d7219d

…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

3e6d613

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

cherrymui added this to the Unplanned milestone

cherrymui added the Performance label

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

e20be18

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: WIP implement codegen of count-{leading,trailing}-ones o…

54f6862

…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

28c3b2a

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: WIP implement codegen of count-{leading,trailing}-ones o…

d3ae636

…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

660c01e

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/asm, cmd/internal/obj/loong64: implement {8,16}-bit sign extensio…

ede4688

…ns with EXTW{B,H}

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64

3a4f41c

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

578eaa8

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: WIP implement codegen of count-{leading,trailing}-ones o…

0c6146c

…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

3026d3f

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/asm: use single-instruction forms for all loong64 sign and zero e…

525b62a

…xtensions

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64

18aa217

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

870eec7

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: WIP implement codegen of count-{leading,trailing}-ones o…

cedcc29

…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

76fc130

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

9012f2a

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/asm: use single-instruction forms for all loong64 sign and zero e…

d14a9a4

…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64

The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: implement FMA codegen for loong64

c3b38ff

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

d44b9f2

For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

db6413f

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up bits.Reverse intrinsics for loong64

9cfb8d5

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e

Contributor

gopherbot commented Apr 11, 2023

Change https://go.dev/cl/483656 mentions this issue: cmd/compile: wire up bits.Reverse intrinsics for loong64

xen0n added a commit to xen0n/go that referenced this issue


          cmd/asm: use single-instruction forms for all loong64 sign and zero e…

b487c05

…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
(cherry picked from commit 6c2c3c8470a0a5d0e756e50cf45f140d553ef0b2)

xen0n added a commit to xen0n/go that referenced this issue


          [release-branch.go1.20] cmd/compile: wire up math/bits.TrailingZeros …

a023c49

…intrinsics for loong64

The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
(cherry picked from commit ba1650c3c739434795465d953ef9a193a68c5024)

xen0n added a commit to xen0n/go that referenced this issue


          [release-branch.go1.20] cmd/compile: implement FMA codegen for loong64

e578587

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
(cherry picked from commit 03e1790d8d84c3955b0294992f1d7b6b7693ed3f)

xen0n added a commit to xen0n/go that referenced this issue


          [release-branch.go1.20] cmd/compile: wire up math/bits.Len intrinsics…

da0d766

… for loong64

For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
(cherry picked from commit 80a298243a07e982573e14723d8133fc5be45065)

xen0n added a commit to xen0n/go that referenced this issue


          [release-branch.go1.20] cmd/compile: wire up Bswap/ReverseBytes intri…

67affb2

…nsics for loong64

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
(cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up bits.Reverse intrinsics for loong64

89353f8

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
(cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)

xen0n added a commit to xen0n/go that referenced this issue


          [release-branch.go1.20] cmd/compile: wire up Bswap/ReverseBytes intri…

4cbacdc

…nsics for loong64

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

[xen0n: removed Bswap16 because go1.20 doesn't support this op]
(cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)

xen0n added a commit to xen0n/go that referenced this issue


          cmd/compile: wire up bits.Reverse intrinsics for loong64

90f2e71

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
(cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)

Contributor

gopherbot commented Apr 9, 2024

Change https://go.dev/cl/577515 mentions this issue: cmd/compile/internal: intrinsify publicationBarrier on loong64

Contributor

gopherbot commented Apr 20, 2024

Change https://go.dev/cl/580280 mentions this issue: cmd/compile, math: make math.Ceil/Floor/RoundToEven/Trunc/Abs/CopySign intrinsics on loong64

Contributor

gopherbot commented Apr 20, 2024

Change https://go.dev/cl/580283 mentions this issue: cmd/compile: intrinsics for math.min/max and implement float min/max in hardware on loong64

gopherbot pushed a commit that referenced this issue


          cmd/compile, math: improve implementation of math.{Max,Min} on loong64

ff14e08

Make math.{Min,Max} intrinsics and implement math.{archMax,archMin}
in hardware.

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A6000 @ 2500.00MHz
         │  old.bench   │              new.bench              │
         │    sec/op    │   sec/op     vs base                │
Max         7.606n ± 0%   3.087n ± 0%  -59.41% (p=0.000 n=20)
Min         7.205n ± 0%   2.904n ± 0%  -59.69% (p=0.000 n=20)
MinFloat   37.220n ± 0%   4.802n ± 0%  -87.10% (p=0.000 n=20)
MaxFloat   33.620n ± 0%   4.802n ± 0%  -85.72% (p=0.000 n=20)
geomean     16.18n        3.792n       -76.57%

goos: linux
goarch: loong64
pkg: runtime
cpu: Loongson-3A5000 @ 2500.00MHz
         │  old.bench   │              new.bench              │
         │    sec/op    │   sec/op     vs base                │
Max        10.010n ± 0%   7.196n ± 0%  -28.11% (p=0.000 n=20)
Min         8.806n ± 0%   7.155n ± 0%  -18.75% (p=0.000 n=20)
MinFloat   60.010n ± 0%   7.976n ± 0%  -86.71% (p=0.000 n=20)
MaxFloat   56.410n ± 0%   7.980n ± 0%  -85.85% (p=0.000 n=20)
geomean     23.37n        7.566n       -67.63%

Updates #59120.

Change-Id: I6815d20bc304af3cbf5d6ca8fe0ca1c2ddebea2d
Reviewed-on: https://go-review.googlesource.com/c/go/+/580283
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue


          cmd/compile, math: make math.{Abs,Copysign} intrinsics on loong64

e705a2d

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A6000 @ 2500.00MHz
         │  old.bench   │              new.bench               │
         │    sec/op    │    sec/op     vs base                │
Copysign   1.9710n ± 0%   0.8006n ± 0%  -59.38% (p=0.000 n=10)
Abs        1.8745n ± 0%   0.8006n ± 0%  -57.29% (p=0.000 n=10)
geomean     1.922n        0.8006n       -58.35%

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A5000 @ 2500.00MHz
         │  old.bench   │              new.bench               │
         │    sec/op    │    sec/op     vs base                │
Copysign   2.4020n ± 0%   0.9006n ± 0%  -62.51% (p=0.000 n=10)
Abs        2.4020n ± 0%   0.8005n ± 0%  -66.67% (p=0.000 n=10)
geomean     2.402n        0.8491n       -64.65%

Updates #59120.

Change-Id: Ic409e1f4d15ad15cb3568a5aaa100046e9302842
Reviewed-on: https://go-review.googlesource.com/c/go/+/580280
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>

Contributor

gopherbot commented Nov 2, 2024

Change https://go.dev/cl/624575 mentions this issue: cmd/compile: wire up math/bits.Len intrinsics for loong64

Contributor

gopherbot commented Nov 2, 2024

Change https://go.dev/cl/624576 mentions this issue: cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

Contributor

gopherbot commented Nov 2, 2024

Change https://go.dev/cl/624276 mentions this issue: cmd/compile: wire up bits.Reverse intrinsics for loong64

gopherbot pushed a commit that referenced this issue


          cmd/compile: wire up math/bits.Len intrinsics for loong64

d98c518

For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
               |  bench.old  |              bench.new              |
               |   sec/op    |   sec/op     vs base                |
LeadingZeros     3.660n ± 0%   1.348n ± 0%  -63.17% (p=0.000 n=20)
LeadingZeros8    1.777n ± 0%   1.767n ± 0%   -0.56% (p=0.000 n=20)
LeadingZeros16   2.816n ± 0%   1.770n ± 0%  -37.14% (p=0.000 n=20)
LeadingZeros32   5.293n ± 1%   1.683n ± 0%  -68.21% (p=0.000 n=20)
LeadingZeros64   3.622n ± 0%   1.349n ± 0%  -62.76% (p=0.000 n=20)
geomean          3.229n        1.571n       -51.35%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
LeadingZeros      2.410n ± 0%    1.103n ± 1%  -54.23% (p=0.000 n=20)
LeadingZeros8     1.236n ± 0%    1.501n ± 0%  +21.44% (p=0.000 n=20)
LeadingZeros16    2.106n ± 0%    1.501n ± 0%  -28.73% (p=0.000 n=20)
LeadingZeros32    2.860n ± 0%    1.324n ± 0%  -53.72% (p=0.000 n=20)
LeadingZeros64   2.6135n ± 0%   0.9509n ± 0%  -63.62% (p=0.000 n=20)
geomean           2.159n         1.256n       -41.81%

Updates #59120

This patch is a copy of CL 483356.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: Iee81a17f7da06d77a427e73dfcc016f2b15ae556
Reviewed-on: https://go-review.googlesource.com/c/go/+/624575
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>

gopherbot pushed a commit that referenced this issue


          cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64

d6fb0ab

Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
ReverseBytes     2.0020n ± 0%   0.4040n ± 0%  -79.82% (p=0.000 n=20)
ReverseBytes16   0.8866n ± 1%   0.8007n ± 0%   -9.69% (p=0.000 n=20)
ReverseBytes32   1.2195n ± 0%   0.8007n ± 0%  -34.34% (p=0.000 n=20)
ReverseBytes64   2.0705n ± 0%   0.8008n ± 0%  -61.32% (p=0.000 n=20)
geomean           1.455n        0.6749n       -53.62%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
               |  bench.old   |              bench.new               |
               |    sec/op    |    sec/op     vs base                |
ReverseBytes     2.8040n ± 0%   0.5205n ± 0%  -81.44% (p=0.000 n=20)
ReverseBytes16   0.7066n ± 0%   0.8011n ± 0%  +13.37% (p=0.000 n=20)
ReverseBytes32   1.5500n ± 0%   0.8010n ± 0%  -48.32% (p=0.000 n=20)
ReverseBytes64   2.7665n ± 0%   0.8010n ± 0%  -71.05% (p=0.000 n=20)
geomean           1.707n        0.7192n       -57.87%

Updates #59120

This patch is a copy of CL 483357.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: If355354cd031533df91991fcc3392e5a6c314295
Reviewed-on: https://go-review.googlesource.com/c/go/+/624576
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>

Contributor

gopherbot commented Nov 6, 2024

Change https://go.dev/cl/625335 mentions this issue: cmd/compile: implement FMA codegen for loong64

gopherbot pushed a commit that referenced this issue


          cmd/compile: implement FMA codegen for loong64

e6cc9d2

Benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A6000 @ 2500.00MHz
    |  bench.old   |              bench.new              |
    |    sec/op    |   sec/op     vs base                |
FMA   25.930n ± 0%   2.002n ± 0%  -92.28% (p=0.000 n=10)

goos: linux
goarch: loong64
pkg: math
cpu: Loongson-3A5000 @ 2500.00MHz
    |  bench.old   |              bench.new              |
    |    sec/op    |   sec/op     vs base                |
FMA   32.840n ± 0%   2.002n ± 0%  -93.90% (p=0.000 n=10)

Updates #59120

This patch is a copy of CL 483355.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: I88b89d23f00864f9173a182a47ee135afec7ed6e
Reviewed-on: https://go-review.googlesource.com/c/go/+/625335
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>

gopherbot pushed a commit that referenced this issue


          cmd/compiler,internal/runtime/atomic: optimize xchg{32,64} on loong64

72a92ab

Use Loong64's atomic operation instruction AMSWAPDB{W,V} (full barrier)
to implement atomic.Xchg{32,64}

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
           |  old.bench    |  new.bench                          |
           |  sec/op       |  sec/op        vs base              |
Xchg          26.44n ± 0%     12.01n ± 0%   -54.58% (p=0.000 n=20)
Xchg-2        30.10n ± 0%     25.58n ± 0%   -15.02% (p=0.000 n=20)
Xchg-4        30.06n ± 0%     24.82n ± 0%   -17.43% (p=0.000 n=20)
Xchg64        26.44n ± 0%     12.02n ± 0%   -54.54% (p=0.000 n=20)
Xchg64-2      30.10n ± 0%     25.57n ± 0%   -15.05% (p=0.000 n=20)
Xchg64-4      30.05n ± 0%     24.80n ± 0%   -17.47% (p=0.000 n=20)
geomean       28.81n          19.68n        -31.69%

goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
           |  old.bench    |  new.bench                          |
           |  sec/op       |  sec/op        vs base              |
Xchg          25.62n ± 0%     12.41n ± 0%  -51.56% (p=0.000 n=20)
Xchg-2        35.01n ± 0%     20.59n ± 0%  -41.19% (p=0.000 n=20)
Xchg-4        34.63n ± 0%     19.59n ± 0%  -43.42% (p=0.000 n=20)
Xchg64        25.62n ± 0%     12.41n ± 0%  -51.56% (p=0.000 n=20)
Xchg64-2      35.01n ± 0%     20.59n ± 0%  -41.19% (p=0.000 n=20)
Xchg64-4      34.67n ± 0%     19.59n ± 0%  -43.50% (p=0.000 n=20)
geomean       31.44n          17.11n       -45.59%

Updates #59120.

Change-Id: Ied74fc20338b63799c6d6eeb122c31b42cff0f7e
Reviewed-on: https://go-review.googlesource.com/c/go/+/481578
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: WANG Xuerui <git@xen0n.name>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>

gopherbot pushed a commit that referenced this issue


          cmd/compile: wire up bits.Reverse intrinsics for loong64

583d750

Micro-benchmark results on Loongson 3A5000 and 3A6000:

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A6000 @ 2500.00MHz
          |  CL 624576   |               this CL                |
          |    sec/op    |    sec/op     vs base                |
Reverse     2.8130n ± 0%   0.8008n ± 0%  -71.53% (p=0.000 n=20)
Reverse8    0.7014n ± 0%   0.4040n ± 0%  -42.40% (p=0.000 n=20)
Reverse16   1.2975n ± 0%   0.6632n ± 1%  -48.89% (p=0.000 n=20)
Reverse32   2.7520n ± 0%   0.4042n ± 0%  -85.31% (p=0.000 n=20)
Reverse64   2.8970n ± 0%   0.4041n ± 0%  -86.05% (p=0.000 n=20)
geomean      1.828n        0.5116n       -72.01%

goos: linux
goarch: loong64
pkg: math/bits
cpu: Loongson-3A5000 @ 2500.00MHz
          |  CL 624576   |               this CL                |
          |    sec/op    |    sec/op     vs base                |
Reverse     4.0050n ± 0%   0.8011n ± 0%  -80.00% (p=0.000 n=20)
Reverse8    0.8010n ± 0%   0.5210n ± 1%  -34.96% (p=0.000 n=20)
Reverse16   1.6160n ± 0%   0.6008n ± 0%  -62.82% (p=0.000 n=20)
Reverse32   3.8550n ± 0%   0.5179n ± 0%  -86.57% (p=0.000 n=20)
Reverse64   3.8050n ± 0%   0.5177n ± 0%  -86.40% (p=0.000 n=20)
geomean      2.378n        0.5828n       -75.49%

Updates #59120

This patch is a copy of CL 483656.
Co-authored-by: WANG Xuerui <git@xen0n.name>

Change-Id: I98681091763279279c8404bd0295785f13ea1c8e
Reviewed-on: https://go-review.googlesource.com/c/go/+/624276
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment