-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: implement more optimizations on loong64 #59120
Comments
…ns with EXTW{B,H} Updates golang#59120 Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
Updates golang#59120 Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
…n loong64 tests TODO Updates golang#59120 Change-Id: Icde85d717999600954244c1105b7c55759d3469f
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
…n loong64 tests TODO Updates golang#59120 Change-Id: Icde85d717999600954244c1105b7c55759d3469f
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
…n loong64 tests TODO Updates golang#59120 Change-Id: Icde85d717999600954244c1105b7c55759d3469f
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
…ns with EXTW{B,H} Updates golang#59120 Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
Updates golang#59120 Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
…n loong64 tests TODO Updates golang#59120 Change-Id: Icde85d717999600954244c1105b7c55759d3469f
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
…xtensions Updates golang#59120 Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
Updates golang#59120 Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
…n loong64 tests TODO Updates golang#59120 Change-Id: Icde85d717999600954244c1105b7c55759d3469f
Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ ReverseBytes 3.0130n ± 0% 0.6517n ± 2% -78.37% (p=0.000 n=10) ReverseBytes16 0.9027n ± 0% 0.6526n ± 2% -27.71% (p=0.000 n=10) ReverseBytes32 1.7040n ± 0% 0.6511n ± 1% -61.79% (p=0.000 n=10) ReverseBytes64 2.7080n ± 0% 0.6499n ± 1% -76.00% (p=0.000 n=10) geomean 1.882n 0.6513n -65.40% Go1 benchmark results indicate no meaningful change except for micro-architecture-related fluctuations. Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
…xtensions 8- and 16-bit sign extensions and 32-bit zero extensions were realized with left and right shifts before this change. We now support assembling EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn respectively. Benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479495 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 14.12 ± 1% 14.06 ± 1% ~ (p=0.393 n=10) Fannkuch11 3.420 ± 0% 3.421 ± 0% +0.04% (p=0.001 n=10) FmtFprintfEmpty 94.72n ± 0% 94.97n ± 0% +0.26% (p=0.000 n=10) FmtFprintfString 152.6n ± 0% 155.3n ± 0% +1.77% (p=0.000 n=10) FmtFprintfInt 154.5n ± 0% 154.5n ± 0% ~ (p=0.263 n=10) FmtFprintfIntInt 237.7n ± 0% 237.1n ± 0% -0.21% (p=0.000 n=10) FmtFprintfPrefixedInt 313.1n ± 0% 313.0n ± 0% -0.03% (p=0.000 n=10) FmtFprintfFloat 394.1n ± 0% 392.8n ± 0% -0.32% (p=0.000 n=10) FmtManyArgs 934.3n ± 0% 912.6n ± 0% -2.32% (p=0.000 n=10) GobDecode 15.29m ± 1% 15.23m ± 1% ~ (p=0.280 n=10) GobEncode 17.76m ± 0% 17.66m ± 0% -0.60% (p=0.000 n=10) Gzip 416.0m ± 0% 404.4m ± 0% -2.79% (p=0.000 n=10) Gunzip 83.20m ± 0% 80.88m ± 0% -2.79% (p=0.000 n=10) HTTPClientServer 87.82µ ± 1% 87.09µ ± 1% -0.83% (p=0.000 n=10) JSONEncode 18.56m ± 0% 18.54m ± 0% ~ (p=0.123 n=10) JSONDecode 76.53m ± 0% 78.22m ± 1% +2.21% (p=0.000 n=10) Mandelbrot200 7.217m ± 0% 7.215m ± 0% ~ (p=0.143 n=10) GoParse 7.587m ± 1% 7.520m ± 1% ~ (p=0.165 n=10) RegexpMatchEasy0_32 134.2n ± 0% 134.5n ± 0% +0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 1.366µ ± 0% 1.364µ ± 0% -0.15% (p=0.000 n=10) RegexpMatchEasy1_32 163.0n ± 0% 164.0n ± 0% +0.61% (p=0.000 n=10) RegexpMatchEasy1_1K 1.497µ ± 0% 1.492µ ± 0% -0.33% (p=0.000 n=10) RegexpMatchMedium_32 1.415µ ± 0% 1.403µ ± 0% -0.85% (p=0.000 n=10) RegexpMatchMedium_1K 41.61µ ± 0% 41.05µ ± 0% -1.36% (p=0.000 n=10) RegexpMatchHard_32 2.121µ ± 0% 2.070µ ± 0% -2.43% (p=0.000 n=10) RegexpMatchHard_1K 62.64µ ± 0% 60.87µ ± 0% -2.83% (p=0.000 n=10) Revcomp 1.204 ± 0% 1.210 ± 0% +0.51% (p=0.000 n=10) Template 118.0m ± 0% 115.2m ± 1% -2.31% (p=0.000 n=10) TimeParse 414.8n ± 0% 410.6n ± 0% -1.01% (p=0.000 n=10) TimeFormat 510.7n ± 0% 508.2n ± 0% -0.48% (p=0.000 n=10) geomean 102.3µ 101.7µ -0.60% │ CL 479495 │ this CL │ │ B/s │ B/s vs base │ GobDecode 47.88Mi ± 1% 48.05Mi ± 1% ~ (p=0.280 n=10) GobEncode 41.20Mi ± 0% 41.45Mi ± 0% +0.60% (p=0.000 n=10) Gzip 44.49Mi ± 0% 45.77Mi ± 0% +2.87% (p=0.000 n=10) Gunzip 222.4Mi ± 0% 228.8Mi ± 0% +2.87% (p=0.000 n=10) JSONEncode 99.69Mi ± 0% 99.82Mi ± 0% ~ (p=0.118 n=10) JSONDecode 24.19Mi ± 0% 23.66Mi ± 1% -2.19% (p=0.000 n=10) GoParse 7.281Mi ± 2% 7.343Mi ± 1% ~ (p=0.187 n=10) RegexpMatchEasy0_32 227.4Mi ± 0% 226.9Mi ± 0% -0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 715.0Mi ± 0% 716.0Mi ± 0% +0.13% (p=0.000 n=10) RegexpMatchEasy1_32 187.3Mi ± 0% 186.1Mi ± 0% -0.62% (p=0.000 n=10) RegexpMatchEasy1_1K 652.3Mi ± 0% 654.5Mi ± 0% +0.34% (p=0.000 n=10) RegexpMatchMedium_32 21.57Mi ± 0% 21.74Mi ± 0% +0.80% (p=0.000 n=10) RegexpMatchMedium_1K 23.47Mi ± 0% 23.79Mi ± 0% +1.38% (p=0.000 n=10) RegexpMatchHard_32 14.39Mi ± 0% 14.74Mi ± 0% +2.45% (p=0.000 n=10) RegexpMatchHard_1K 15.59Mi ± 0% 16.04Mi ± 0% +2.87% (p=0.000 n=10) Revcomp 201.3Mi ± 0% 200.3Mi ± 0% -0.51% (p=0.000 n=10) Template 15.69Mi ± 0% 16.06Mi ± 1% +2.37% (p=0.000 n=10) geomean 61.31Mi 61.82Mi +0.84% The test binaries were pre-compiled with `go test -c`, and the test runs were wrapped with `perf stat record` for recording dynamic instruction counts. The instruction count, IPC and branch misprediction rate did not meaningfully change. As for the JSONDecode regression, `perf stat` is used to check micro-architectural details: $ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \ -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x Before: 4,256.10 msec task-clock # 1.061 CPUs utilized 61,431 context-switches # 14.434 K/sec 3 cpu-migrations # 0.705 /sec 3,297 page-faults # 774.652 /sec 10,364,990,422 cycles # 2.435 GHz 19,640,571,817 instructions # 1.89 insn per cycle 4,267,623,324 branches # 1.003 G/sec 44,164,375 branch-misses # 1.03% of all branches After: 4,343.17 msec task-clock # 1.061 CPUs utilized 62,742 context-switches # 14.446 K/sec 5 cpu-migrations # 1.151 /sec 3,044 page-faults # 700.871 /sec 10,577,322,342 cycles # 2.435 GHz 19,582,895,547 instructions # 1.85 insn per cycle 4,266,051,537 branches # 982.244 M/sec 46,298,286 branch-misses # 1.09% of all branches Instruction count decreased by 0.29% but cycle count went up by 2.05%, while branch misprediction rate raised too. This is likely caused by the micro-architecture's sensitivity towards changed code layout; the optimization implemented here should be a net win otherwise. Updates golang#59120 Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
The runtime malloc implementation makes use of these, among others. Some generic strength reduction rules for Ctz ops have also been added, though only enabled for loong64 for now. This is necessary to make the optimization profitable at all, as the LA464 architecture apparently handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very badly if the compiled branch isn't a simple BEQZ any more (that used to be the case before, when the compiler is able to peek into the pure Go implementation of TrailingZeros). Without the generic rules this change is going to be a big perf hit (as bad as 7~10% in select go1 benchmark cases). The generic changes are benchmarked on linux/amd64 (Threadripper 3990X) and darwin/arm64 (Apple M1 Pro) too, but results are either mixed (amd64) or even net loss (arm64). So, for now those rules are guarded with a predicate that only enables them for loong64. Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ TrailingZeros 2.758n ± 0% 1.004n ± 0% -63.60% (p=0.000 n=10) TrailingZeros8 1.508n ± 0% 1.219n ± 0% -19.20% (p=0.000 n=10) TrailingZeros16 3.526n ± 0% 1.437n ± 0% -59.25% (p=0.000 n=10) TrailingZeros32 3.161n ± 0% 1.004n ± 0% -68.23% (p=0.000 n=10) TrailingZeros64 2.759n ± 0% 1.003n ± 0% -63.65% (p=0.000 n=10) geomean 2.638n 1.121n -57.51% Go1 benchmark results on the same machine: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479496 v8 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 14.10 ± 1% 13.64 ± 1% -3.28% (p=0.000 n=10) Fannkuch11 3.421 ± 0% 3.421 ± 0% ~ (p=0.075 n=10) FmtFprintfEmpty 94.78n ± 0% 94.50n ± 0% -0.30% (p=0.000 n=10) FmtFprintfString 155.0n ± 0% 154.1n ± 1% ~ (p=1.000 n=10) FmtFprintfInt 157.2n ± 0% 155.2n ± 1% -1.27% (p=0.000 n=10) FmtFprintfIntInt 242.1n ± 0% 238.0n ± 1% -1.73% (p=0.000 n=10) FmtFprintfPrefixedInt 337.6n ± 0% 334.6n ± 0% -0.89% (p=0.000 n=10) FmtFprintfFloat 399.0n ± 0% 396.4n ± 0% -0.65% (p=0.000 n=10) FmtManyArgs 959.8n ± 0% 923.4n ± 0% -3.79% (p=0.000 n=10) GobDecode 15.63m ± 3% 15.17m ± 1% -2.90% (p=0.001 n=10) GobEncode 18.43m ± 3% 17.62m ± 0% -4.38% (p=0.000 n=10) Gzip 405.1m ± 0% 405.4m ± 0% +0.06% (p=0.035 n=10) Gunzip 86.84m ± 0% 87.20m ± 0% +0.41% (p=0.000 n=10) HTTPClientServer 88.47µ ± 0% 86.92µ ± 1% -1.75% (p=0.000 n=10) JSONEncode 18.84m ± 0% 18.66m ± 0% -0.95% (p=0.000 n=10) JSONDecode 79.35m ± 0% 75.77m ± 1% -4.51% (p=0.000 n=10) Mandelbrot200 7.215m ± 0% 7.215m ± 0% ~ (p=0.315 n=10) GoParse 7.591m ± 1% 7.407m ± 1% -2.43% (p=0.000 n=10) RegexpMatchEasy0_32 133.8n ± 0% 134.3n ± 0% +0.37% (p=0.000 n=10) RegexpMatchEasy0_1K 1.540µ ± 0% 1.544µ ± 0% +0.26% (p=0.000 n=10) RegexpMatchEasy1_32 164.1n ± 0% 165.4n ± 0% +0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 1.626µ ± 0% 1.629µ ± 0% +0.18% (p=0.000 n=10) RegexpMatchMedium_32 1.403µ ± 0% 1.413µ ± 0% +0.71% (p=0.000 n=10) RegexpMatchMedium_1K 41.22µ ± 0% 41.59µ ± 0% +0.90% (p=0.000 n=10) RegexpMatchHard_32 2.071µ ± 0% 2.060µ ± 0% -0.53% (p=0.000 n=10) RegexpMatchHard_1K 61.05µ ± 0% 61.30µ ± 0% +0.41% (p=0.001 n=10) Revcomp 1.351 ± 0% 1.357 ± 0% +0.42% (p=0.000 n=10) Template 117.3m ± 1% 110.6m ± 2% -5.71% (p=0.000 n=10) TimeParse 411.9n ± 0% 411.7n ± 0% ~ (p=0.117 n=10) TimeFormat 514.2n ± 0% 499.9n ± 0% -2.77% (p=0.000 n=10) geomean 104.2µ 103.0µ -1.15% │ CL 479496 v8 │ this CL │ │ B/s │ B/s vs base │ GobDecode 46.84Mi ± 3% 48.24Mi ± 1% +2.98% (p=0.001 n=10) GobEncode 39.72Mi ± 4% 41.53Mi ± 0% +4.57% (p=0.000 n=10) Gzip 45.68Mi ± 0% 45.65Mi ± 0% -0.05% (p=0.029 n=10) Gunzip 213.1Mi ± 0% 212.2Mi ± 0% -0.41% (p=0.000 n=10) JSONEncode 98.23Mi ± 0% 99.18Mi ± 0% +0.97% (p=0.000 n=10) JSONDecode 23.32Mi ± 0% 24.42Mi ± 1% +4.72% (p=0.000 n=10) GoParse 7.277Mi ± 1% 7.458Mi ± 1% +2.49% (p=0.000 n=10) RegexpMatchEasy0_32 228.1Mi ± 0% 227.3Mi ± 0% -0.36% (p=0.000 n=10) RegexpMatchEasy0_1K 634.2Mi ± 0% 632.5Mi ± 0% -0.27% (p=0.000 n=10) RegexpMatchEasy1_32 186.0Mi ± 0% 184.5Mi ± 0% -0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 600.4Mi ± 0% 599.4Mi ± 0% -0.17% (p=0.000 n=10) RegexpMatchMedium_32 21.75Mi ± 0% 21.60Mi ± 0% -0.70% (p=0.000 n=10) RegexpMatchMedium_1K 23.69Mi ± 0% 23.48Mi ± 0% -0.89% (p=0.000 n=10) RegexpMatchHard_32 14.73Mi ± 0% 14.81Mi ± 0% +0.52% (p=0.000 n=10) RegexpMatchHard_1K 15.99Mi ± 0% 15.93Mi ± 0% -0.42% (p=0.000 n=10) Revcomp 179.4Mi ± 0% 178.6Mi ± 0% -0.42% (p=0.000 n=10) Template 15.78Mi ± 1% 16.73Mi ± 2% +6.04% (p=0.000 n=10) geomean 59.97Mi 60.58Mi +1.02% The change should be a net win, as all it does is to pattern-match and replace Ctz ops into respective native instructions, so any performance regression is likely also micro-architecture related, like observed in CL 479496's results. (Indeed, some of the more drastic improvements may well also be coincidental, but the point is that there is at least a small amount of deterministic improvements anyway.) Updates golang#59120 Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
Benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479498 v11 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 13.64 ± 1% 13.75 ± 2% ~ (p=0.579 n=10) Fannkuch11 3.421 ± 0% 3.650 ± 0% +6.70% (p=0.000 n=10) FmtFprintfEmpty 94.50n ± 0% 94.45n ± 0% -0.05% (p=0.000 n=10) FmtFprintfString 154.1n ± 1% 155.2n ± 0% ~ (p=0.689 n=10) FmtFprintfInt 155.2n ± 1% 154.4n ± 0% ~ (p=0.785 n=10) FmtFprintfIntInt 238.0n ± 1% 237.1n ± 0% ~ (p=0.721 n=10) FmtFprintfPrefixedInt 334.6n ± 0% 312.8n ± 0% -6.52% (p=0.000 n=10) FmtFprintfFloat 396.4n ± 0% 390.5n ± 0% -1.49% (p=0.000 n=10) FmtManyArgs 923.4n ± 0% 905.0n ± 0% -2.00% (p=0.000 n=10) GobDecode 15.17m ± 1% 14.93m ± 1% -1.59% (p=0.000 n=10) GobEncode 17.62m ± 0% 17.33m ± 0% -1.65% (p=0.001 n=10) Gzip 405.4m ± 0% 404.3m ± 0% -0.26% (p=0.000 n=10) Gunzip 87.20m ± 0% 80.92m ± 0% -7.20% (p=0.000 n=10) HTTPClientServer 86.92µ ± 1% 86.14µ ± 0% -0.90% (p=0.000 n=10) JSONEncode 18.66m ± 0% 18.49m ± 0% -0.91% (p=0.000 n=10) JSONDecode 75.77m ± 1% 77.34m ± 1% +2.07% (p=0.000 n=10) Mandelbrot200 7.215m ± 0% 6.521m ± 0% -9.62% (p=0.000 n=10) GoParse 7.407m ± 1% 7.324m ± 1% -1.12% (p=0.003 n=10) RegexpMatchEasy0_32 134.3n ± 0% 134.6n ± 0% +0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 1.544µ ± 0% 1.365µ ± 0% -11.63% (p=0.000 n=10) RegexpMatchEasy1_32 165.4n ± 0% 164.1n ± 0% -0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 1.629µ ± 0% 1.492µ ± 0% -8.41% (p=0.000 n=10) RegexpMatchMedium_32 1.413µ ± 0% 1.404µ ± 0% -0.64% (p=0.000 n=10) RegexpMatchMedium_1K 41.59µ ± 0% 41.05µ ± 0% -1.28% (p=0.000 n=10) RegexpMatchHard_32 2.060µ ± 0% 2.072µ ± 0% +0.58% (p=0.000 n=10) RegexpMatchHard_1K 61.30µ ± 0% 60.89µ ± 0% -0.68% (p=0.000 n=10) Revcomp 1.357 ± 0% 1.199 ± 1% -11.64% (p=0.000 n=10) Template 110.6m ± 2% 112.3m ± 2% ~ (p=0.105 n=10) TimeParse 411.7n ± 0% 414.2n ± 1% +0.60% (p=0.000 n=10) TimeFormat 499.9n ± 0% 496.9n ± 0% -0.60% (p=0.000 n=10) geomean 103.0µ 101.0µ -1.98% │ CL 479498 v11 │ this CL │ │ B/s │ B/s vs base │ GobDecode 48.24Mi ± 1% 49.02Mi ± 1% +1.62% (p=0.000 n=10) GobEncode 41.53Mi ± 0% 42.23Mi ± 0% +1.69% (p=0.001 n=10) Gzip 45.65Mi ± 0% 45.77Mi ± 0% +0.25% (p=0.000 n=10) Gunzip 212.2Mi ± 0% 228.7Mi ± 0% +7.76% (p=0.000 n=10) JSONEncode 99.18Mi ± 0% 100.08Mi ± 0% +0.91% (p=0.000 n=10) JSONDecode 24.42Mi ± 1% 23.93Mi ± 1% -2.03% (p=0.000 n=10) GoParse 7.458Mi ± 1% 7.544Mi ± 1% +1.15% (p=0.001 n=10) RegexpMatchEasy0_32 227.3Mi ± 0% 226.8Mi ± 0% -0.21% (p=0.000 n=10) RegexpMatchEasy0_1K 632.5Mi ± 0% 715.7Mi ± 0% +13.15% (p=0.000 n=10) RegexpMatchEasy1_32 184.5Mi ± 0% 186.0Mi ± 0% +0.81% (p=0.000 n=10) RegexpMatchEasy1_1K 599.4Mi ± 0% 654.3Mi ± 0% +9.17% (p=0.000 n=10) RegexpMatchMedium_32 21.60Mi ± 0% 21.74Mi ± 0% +0.64% (p=0.000 n=10) RegexpMatchMedium_1K 23.48Mi ± 0% 23.78Mi ± 0% +1.30% (p=0.000 n=10) RegexpMatchHard_32 14.81Mi ± 0% 14.72Mi ± 0% -0.58% (p=0.000 n=10) RegexpMatchHard_1K 15.93Mi ± 0% 16.04Mi ± 0% +0.72% (p=0.000 n=10) Revcomp 178.6Mi ± 0% 202.2Mi ± 1% +13.18% (p=0.000 n=10) Template 16.73Mi ± 2% 16.48Mi ± 2% ~ (p=0.093 n=10) geomean 60.58Mi 62.23Mi +2.72% The only significant regression is the Fannkuch11 case; perf records are manually inspected, with the hottest part of the code virtually unchanged except for the alignment of two instructions, that seems to sit at different sides of a 32- or even 64-byte boundary. So again, the regression is likely due to micro-architecture quirks, and the change is in fact a win across the board. Updates golang#59120 Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
For the SubFromLen64 codegen test case to work as intended, we need to fold c-(-(x-d)) into x+(c-d). Still, some instances of LeadingZeros are not optimized into single CLZ instructions right now (actually, the LeadingZeros micro-benchmarks are currently still compiled with redundant adds/subs of 64, due to interference of loop optimizations before lowering), but perf numbers indicate it's not that bad after all. Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ LeadingZeros 3.675n ± 0% 1.545n ± 1% -57.96% (p=0.000 n=10) LeadingZeros8 2.001n ± 0% 1.868n ± 0% -6.62% (p=0.000 n=10) LeadingZeros16 3.144n ± 0% 1.864n ± 1% -40.71% (p=0.000 n=10) LeadingZeros32 4.265n ± 1% 1.653n ± 1% -61.24% (p=0.000 n=10) LeadingZeros64 3.962n ± 0% 1.539n ± 0% -61.16% (p=0.000 n=10) geomean 3.299n 1.688n -48.84% go1 benchmark results on the same box: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 483355 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 13.75 ± 2% 13.70 ± 2% ~ (p=0.579 n=10) Fannkuch11 3.650 ± 0% 3.415 ± 0% -6.46% (p=0.000 n=10) FmtFprintfEmpty 94.45n ± 0% 94.98n ± 0% +0.56% (p=0.000 n=10) FmtFprintfString 155.2n ± 0% 151.1n ± 0% -2.61% (p=0.000 n=10) FmtFprintfInt 154.4n ± 0% 153.6n ± 0% -0.52% (p=0.000 n=10) FmtFprintfIntInt 237.1n ± 0% 234.7n ± 0% -0.99% (p=0.000 n=10) FmtFprintfPrefixedInt 312.8n ± 0% 314.2n ± 0% +0.45% (p=0.000 n=10) FmtFprintfFloat 390.5n ± 0% 402.1n ± 0% +2.97% (p=0.000 n=10) FmtManyArgs 905.0n ± 0% 918.6n ± 0% +1.51% (p=0.000 n=10) GobDecode 14.93m ± 1% 14.98m ± 1% +0.33% (p=0.015 n=10) GobEncode 17.33m ± 0% 17.26m ± 1% -0.39% (p=0.023 n=10) Gzip 404.3m ± 0% 404.6m ± 0% +0.08% (p=0.000 n=10) Gunzip 80.92m ± 0% 80.97m ± 0% +0.06% (p=0.000 n=10) HTTPClientServer 86.14µ ± 0% 84.39µ ± 0% -2.03% (p=0.000 n=10) JSONEncode 18.49m ± 0% 18.50m ± 0% ~ (p=0.436 n=10) JSONDecode 77.34m ± 1% 76.26m ± 1% -1.40% (p=0.000 n=10) Mandelbrot200 6.521m ± 0% 6.508m ± 0% ~ (p=0.138 n=10) GoParse 7.324m ± 1% 7.413m ± 1% +1.22% (p=0.005 n=10) RegexpMatchEasy0_32 134.6n ± 0% 134.6n ± 0% ~ (p=0.195 n=10) RegexpMatchEasy0_1K 1.365µ ± 0% 1.366µ ± 0% +0.07% (p=0.038 n=10) RegexpMatchEasy1_32 164.1n ± 0% 164.1n ± 0% ~ (p=0.230 n=10) RegexpMatchEasy1_1K 1.492µ ± 0% 1.492µ ± 0% ~ (p=0.211 n=10) RegexpMatchMedium_32 1.404µ ± 0% 1.403µ ± 0% -0.07% (p=0.000 n=10) RegexpMatchMedium_1K 41.05µ ± 0% 41.04µ ± 0% -0.04% (p=0.000 n=10) RegexpMatchHard_32 2.072µ ± 0% 2.071µ ± 0% -0.05% (p=0.000 n=10) RegexpMatchHard_1K 60.89µ ± 0% 60.87µ ± 0% -0.04% (p=0.000 n=10) Revcomp 1.199 ± 1% 1.200 ± 0% ~ (p=0.481 n=10) Template 112.3m ± 2% 112.9m ± 2% ~ (p=0.353 n=10) TimeParse 414.2n ± 1% 412.5n ± 0% -0.40% (p=0.000 n=10) TimeFormat 496.9n ± 0% 496.6n ± 0% ~ (p=0.341 n=10) geomean 101.0µ 100.7µ -0.26% │ CL 483355 │ this CL │ │ B/s │ B/s vs base │ GobDecode 49.02Mi ± 1% 48.87Mi ± 1% -0.32% (p=0.014 n=10) GobEncode 42.23Mi ± 0% 42.40Mi ± 1% +0.40% (p=0.022 n=10) Gzip 45.77Mi ± 0% 45.73Mi ± 0% -0.07% (p=0.000 n=10) Gunzip 228.7Mi ± 0% 228.6Mi ± 0% -0.06% (p=0.000 n=10) JSONEncode 100.1Mi ± 0% 100.0Mi ± 0% ~ (p=0.470 n=10) JSONDecode 23.93Mi ± 1% 24.27Mi ± 1% +1.43% (p=0.000 n=10) GoParse 7.544Mi ± 1% 7.448Mi ± 1% -1.26% (p=0.005 n=10) RegexpMatchEasy0_32 226.8Mi ± 0% 226.7Mi ± 0% -0.06% (p=0.001 n=10) RegexpMatchEasy0_1K 715.7Mi ± 0% 715.1Mi ± 0% -0.08% (p=0.022 n=10) RegexpMatchEasy1_32 186.0Mi ± 0% 186.0Mi ± 0% ~ (p=0.493 n=10) RegexpMatchEasy1_1K 654.3Mi ± 0% 654.6Mi ± 0% +0.04% (p=0.000 n=10) RegexpMatchMedium_32 21.74Mi ± 0% 21.74Mi ± 0% +0.02% (p=0.022 n=10) RegexpMatchMedium_1K 23.78Mi ± 0% 23.79Mi ± 0% +0.04% (p=0.000 n=10) RegexpMatchHard_32 14.72Mi ± 0% 14.73Mi ± 0% +0.06% (p=0.000 n=10) RegexpMatchHard_1K 16.04Mi ± 0% 16.04Mi ± 0% ~ (p=1.000 n=10) ¹ Revcomp 202.2Mi ± 1% 202.0Mi ± 0% ~ (p=0.469 n=10) Template 16.48Mi ± 2% 16.38Mi ± 2% ~ (p=0.342 n=10) geomean 62.23Mi 62.21Mi -0.04% ¹ all samples are equal In this case though, all significant perf changes are likely due to micro-architectural quirks. Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ ReverseBytes 3.0130n ± 0% 0.6517n ± 2% -78.37% (p=0.000 n=10) ReverseBytes16 0.9027n ± 0% 0.6526n ± 2% -27.71% (p=0.000 n=10) ReverseBytes32 1.7040n ± 0% 0.6511n ± 1% -61.79% (p=0.000 n=10) ReverseBytes64 2.7080n ± 0% 0.6499n ± 1% -76.00% (p=0.000 n=10) geomean 1.882n 0.6513n -65.40% Go1 benchmark results indicate no meaningful change except for micro-architecture-related fluctuations. Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ Reverse 4.2280n ± 0% 0.8029n ± 0% -81.01% (p=0.000 n=10) Reverse8 1.0050n ± 0% 0.8029n ± 0% -20.11% (p=0.000 n=10) Reverse16 1.9600n ± 0% 0.8029n ± 0% -59.04% (p=0.000 n=10) Reverse32 4.0205n ± 0% 0.8029n ± 0% -80.03% (p=0.000 n=10) Reverse64 4.0360n ± 0% 0.8029n ± 0% -80.11% (p=0.000 n=10) geomean 2.668n 0.8029n -69.90% The operation seems unused anywhere else in the tree except in compress/flate, of which a very slight (time geomean -0.16%, throughput geomean +0.16%) improvement was observed with the change applied. Updates golang#59120 Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
Change https://go.dev/cl/483656 mentions this issue: |
…xtensions 8- and 16-bit sign extensions and 32-bit zero extensions were realized with left and right shifts before this change. We now support assembling EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn respectively. Benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479495 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 14.12 ± 1% 14.06 ± 1% ~ (p=0.393 n=10) Fannkuch11 3.420 ± 0% 3.421 ± 0% +0.04% (p=0.001 n=10) FmtFprintfEmpty 94.72n ± 0% 94.97n ± 0% +0.26% (p=0.000 n=10) FmtFprintfString 152.6n ± 0% 155.3n ± 0% +1.77% (p=0.000 n=10) FmtFprintfInt 154.5n ± 0% 154.5n ± 0% ~ (p=0.263 n=10) FmtFprintfIntInt 237.7n ± 0% 237.1n ± 0% -0.21% (p=0.000 n=10) FmtFprintfPrefixedInt 313.1n ± 0% 313.0n ± 0% -0.03% (p=0.000 n=10) FmtFprintfFloat 394.1n ± 0% 392.8n ± 0% -0.32% (p=0.000 n=10) FmtManyArgs 934.3n ± 0% 912.6n ± 0% -2.32% (p=0.000 n=10) GobDecode 15.29m ± 1% 15.23m ± 1% ~ (p=0.280 n=10) GobEncode 17.76m ± 0% 17.66m ± 0% -0.60% (p=0.000 n=10) Gzip 416.0m ± 0% 404.4m ± 0% -2.79% (p=0.000 n=10) Gunzip 83.20m ± 0% 80.88m ± 0% -2.79% (p=0.000 n=10) HTTPClientServer 87.82µ ± 1% 87.09µ ± 1% -0.83% (p=0.000 n=10) JSONEncode 18.56m ± 0% 18.54m ± 0% ~ (p=0.123 n=10) JSONDecode 76.53m ± 0% 78.22m ± 1% +2.21% (p=0.000 n=10) Mandelbrot200 7.217m ± 0% 7.215m ± 0% ~ (p=0.143 n=10) GoParse 7.587m ± 1% 7.520m ± 1% ~ (p=0.165 n=10) RegexpMatchEasy0_32 134.2n ± 0% 134.5n ± 0% +0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 1.366µ ± 0% 1.364µ ± 0% -0.15% (p=0.000 n=10) RegexpMatchEasy1_32 163.0n ± 0% 164.0n ± 0% +0.61% (p=0.000 n=10) RegexpMatchEasy1_1K 1.497µ ± 0% 1.492µ ± 0% -0.33% (p=0.000 n=10) RegexpMatchMedium_32 1.415µ ± 0% 1.403µ ± 0% -0.85% (p=0.000 n=10) RegexpMatchMedium_1K 41.61µ ± 0% 41.05µ ± 0% -1.36% (p=0.000 n=10) RegexpMatchHard_32 2.121µ ± 0% 2.070µ ± 0% -2.43% (p=0.000 n=10) RegexpMatchHard_1K 62.64µ ± 0% 60.87µ ± 0% -2.83% (p=0.000 n=10) Revcomp 1.204 ± 0% 1.210 ± 0% +0.51% (p=0.000 n=10) Template 118.0m ± 0% 115.2m ± 1% -2.31% (p=0.000 n=10) TimeParse 414.8n ± 0% 410.6n ± 0% -1.01% (p=0.000 n=10) TimeFormat 510.7n ± 0% 508.2n ± 0% -0.48% (p=0.000 n=10) geomean 102.3µ 101.7µ -0.60% │ CL 479495 │ this CL │ │ B/s │ B/s vs base │ GobDecode 47.88Mi ± 1% 48.05Mi ± 1% ~ (p=0.280 n=10) GobEncode 41.20Mi ± 0% 41.45Mi ± 0% +0.60% (p=0.000 n=10) Gzip 44.49Mi ± 0% 45.77Mi ± 0% +2.87% (p=0.000 n=10) Gunzip 222.4Mi ± 0% 228.8Mi ± 0% +2.87% (p=0.000 n=10) JSONEncode 99.69Mi ± 0% 99.82Mi ± 0% ~ (p=0.118 n=10) JSONDecode 24.19Mi ± 0% 23.66Mi ± 1% -2.19% (p=0.000 n=10) GoParse 7.281Mi ± 2% 7.343Mi ± 1% ~ (p=0.187 n=10) RegexpMatchEasy0_32 227.4Mi ± 0% 226.9Mi ± 0% -0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 715.0Mi ± 0% 716.0Mi ± 0% +0.13% (p=0.000 n=10) RegexpMatchEasy1_32 187.3Mi ± 0% 186.1Mi ± 0% -0.62% (p=0.000 n=10) RegexpMatchEasy1_1K 652.3Mi ± 0% 654.5Mi ± 0% +0.34% (p=0.000 n=10) RegexpMatchMedium_32 21.57Mi ± 0% 21.74Mi ± 0% +0.80% (p=0.000 n=10) RegexpMatchMedium_1K 23.47Mi ± 0% 23.79Mi ± 0% +1.38% (p=0.000 n=10) RegexpMatchHard_32 14.39Mi ± 0% 14.74Mi ± 0% +2.45% (p=0.000 n=10) RegexpMatchHard_1K 15.59Mi ± 0% 16.04Mi ± 0% +2.87% (p=0.000 n=10) Revcomp 201.3Mi ± 0% 200.3Mi ± 0% -0.51% (p=0.000 n=10) Template 15.69Mi ± 0% 16.06Mi ± 1% +2.37% (p=0.000 n=10) geomean 61.31Mi 61.82Mi +0.84% The test binaries were pre-compiled with `go test -c`, and the test runs were wrapped with `perf stat record` for recording dynamic instruction counts. The instruction count, IPC and branch misprediction rate did not meaningfully change. As for the JSONDecode regression, `perf stat` is used to check micro-architectural details: $ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \ -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x Before: 4,256.10 msec task-clock # 1.061 CPUs utilized 61,431 context-switches # 14.434 K/sec 3 cpu-migrations # 0.705 /sec 3,297 page-faults # 774.652 /sec 10,364,990,422 cycles # 2.435 GHz 19,640,571,817 instructions # 1.89 insn per cycle 4,267,623,324 branches # 1.003 G/sec 44,164,375 branch-misses # 1.03% of all branches After: 4,343.17 msec task-clock # 1.061 CPUs utilized 62,742 context-switches # 14.446 K/sec 5 cpu-migrations # 1.151 /sec 3,044 page-faults # 700.871 /sec 10,577,322,342 cycles # 2.435 GHz 19,582,895,547 instructions # 1.85 insn per cycle 4,266,051,537 branches # 982.244 M/sec 46,298,286 branch-misses # 1.09% of all branches Instruction count decreased by 0.29% but cycle count went up by 2.05%, while branch misprediction rate raised too. This is likely caused by the micro-architecture's sensitivity towards changed code layout; the optimization implemented here should be a net win otherwise. Updates golang#59120 Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e (cherry picked from commit 6c2c3c8470a0a5d0e756e50cf45f140d553ef0b2)
…intrinsics for loong64 The runtime malloc implementation makes use of these, among others. Some generic strength reduction rules for Ctz ops have also been added, though only enabled for loong64 for now. This is necessary to make the optimization profitable at all, as the LA464 architecture apparently handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very badly if the compiled branch isn't a simple BEQZ any more (that used to be the case before, when the compiler is able to peek into the pure Go implementation of TrailingZeros). Without the generic rules this change is going to be a big perf hit (as bad as 7~10% in select go1 benchmark cases). The generic changes are benchmarked on linux/amd64 (Threadripper 3990X) and darwin/arm64 (Apple M1 Pro) too, but results are either mixed (amd64) or even net loss (arm64). So, for now those rules are guarded with a predicate that only enables them for loong64. Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ TrailingZeros 2.758n ± 0% 1.004n ± 0% -63.60% (p=0.000 n=10) TrailingZeros8 1.508n ± 0% 1.219n ± 0% -19.20% (p=0.000 n=10) TrailingZeros16 3.526n ± 0% 1.437n ± 0% -59.25% (p=0.000 n=10) TrailingZeros32 3.161n ± 0% 1.004n ± 0% -68.23% (p=0.000 n=10) TrailingZeros64 2.759n ± 0% 1.003n ± 0% -63.65% (p=0.000 n=10) geomean 2.638n 1.121n -57.51% Go1 benchmark results on the same machine: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479496 v8 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 14.10 ± 1% 13.64 ± 1% -3.28% (p=0.000 n=10) Fannkuch11 3.421 ± 0% 3.421 ± 0% ~ (p=0.075 n=10) FmtFprintfEmpty 94.78n ± 0% 94.50n ± 0% -0.30% (p=0.000 n=10) FmtFprintfString 155.0n ± 0% 154.1n ± 1% ~ (p=1.000 n=10) FmtFprintfInt 157.2n ± 0% 155.2n ± 1% -1.27% (p=0.000 n=10) FmtFprintfIntInt 242.1n ± 0% 238.0n ± 1% -1.73% (p=0.000 n=10) FmtFprintfPrefixedInt 337.6n ± 0% 334.6n ± 0% -0.89% (p=0.000 n=10) FmtFprintfFloat 399.0n ± 0% 396.4n ± 0% -0.65% (p=0.000 n=10) FmtManyArgs 959.8n ± 0% 923.4n ± 0% -3.79% (p=0.000 n=10) GobDecode 15.63m ± 3% 15.17m ± 1% -2.90% (p=0.001 n=10) GobEncode 18.43m ± 3% 17.62m ± 0% -4.38% (p=0.000 n=10) Gzip 405.1m ± 0% 405.4m ± 0% +0.06% (p=0.035 n=10) Gunzip 86.84m ± 0% 87.20m ± 0% +0.41% (p=0.000 n=10) HTTPClientServer 88.47µ ± 0% 86.92µ ± 1% -1.75% (p=0.000 n=10) JSONEncode 18.84m ± 0% 18.66m ± 0% -0.95% (p=0.000 n=10) JSONDecode 79.35m ± 0% 75.77m ± 1% -4.51% (p=0.000 n=10) Mandelbrot200 7.215m ± 0% 7.215m ± 0% ~ (p=0.315 n=10) GoParse 7.591m ± 1% 7.407m ± 1% -2.43% (p=0.000 n=10) RegexpMatchEasy0_32 133.8n ± 0% 134.3n ± 0% +0.37% (p=0.000 n=10) RegexpMatchEasy0_1K 1.540µ ± 0% 1.544µ ± 0% +0.26% (p=0.000 n=10) RegexpMatchEasy1_32 164.1n ± 0% 165.4n ± 0% +0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 1.626µ ± 0% 1.629µ ± 0% +0.18% (p=0.000 n=10) RegexpMatchMedium_32 1.403µ ± 0% 1.413µ ± 0% +0.71% (p=0.000 n=10) RegexpMatchMedium_1K 41.22µ ± 0% 41.59µ ± 0% +0.90% (p=0.000 n=10) RegexpMatchHard_32 2.071µ ± 0% 2.060µ ± 0% -0.53% (p=0.000 n=10) RegexpMatchHard_1K 61.05µ ± 0% 61.30µ ± 0% +0.41% (p=0.001 n=10) Revcomp 1.351 ± 0% 1.357 ± 0% +0.42% (p=0.000 n=10) Template 117.3m ± 1% 110.6m ± 2% -5.71% (p=0.000 n=10) TimeParse 411.9n ± 0% 411.7n ± 0% ~ (p=0.117 n=10) TimeFormat 514.2n ± 0% 499.9n ± 0% -2.77% (p=0.000 n=10) geomean 104.2µ 103.0µ -1.15% │ CL 479496 v8 │ this CL │ │ B/s │ B/s vs base │ GobDecode 46.84Mi ± 3% 48.24Mi ± 1% +2.98% (p=0.001 n=10) GobEncode 39.72Mi ± 4% 41.53Mi ± 0% +4.57% (p=0.000 n=10) Gzip 45.68Mi ± 0% 45.65Mi ± 0% -0.05% (p=0.029 n=10) Gunzip 213.1Mi ± 0% 212.2Mi ± 0% -0.41% (p=0.000 n=10) JSONEncode 98.23Mi ± 0% 99.18Mi ± 0% +0.97% (p=0.000 n=10) JSONDecode 23.32Mi ± 0% 24.42Mi ± 1% +4.72% (p=0.000 n=10) GoParse 7.277Mi ± 1% 7.458Mi ± 1% +2.49% (p=0.000 n=10) RegexpMatchEasy0_32 228.1Mi ± 0% 227.3Mi ± 0% -0.36% (p=0.000 n=10) RegexpMatchEasy0_1K 634.2Mi ± 0% 632.5Mi ± 0% -0.27% (p=0.000 n=10) RegexpMatchEasy1_32 186.0Mi ± 0% 184.5Mi ± 0% -0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 600.4Mi ± 0% 599.4Mi ± 0% -0.17% (p=0.000 n=10) RegexpMatchMedium_32 21.75Mi ± 0% 21.60Mi ± 0% -0.70% (p=0.000 n=10) RegexpMatchMedium_1K 23.69Mi ± 0% 23.48Mi ± 0% -0.89% (p=0.000 n=10) RegexpMatchHard_32 14.73Mi ± 0% 14.81Mi ± 0% +0.52% (p=0.000 n=10) RegexpMatchHard_1K 15.99Mi ± 0% 15.93Mi ± 0% -0.42% (p=0.000 n=10) Revcomp 179.4Mi ± 0% 178.6Mi ± 0% -0.42% (p=0.000 n=10) Template 15.78Mi ± 1% 16.73Mi ± 2% +6.04% (p=0.000 n=10) geomean 59.97Mi 60.58Mi +1.02% The change should be a net win, as all it does is to pattern-match and replace Ctz ops into respective native instructions, so any performance regression is likely also micro-architecture related, like observed in CL 479496's results. (Indeed, some of the more drastic improvements may well also be coincidental, but the point is that there is at least a small amount of deterministic improvements anyway.) Updates golang#59120 Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54 (cherry picked from commit ba1650c3c739434795465d953ef9a193a68c5024)
Benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 479498 v11 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 13.64 ± 1% 13.75 ± 2% ~ (p=0.579 n=10) Fannkuch11 3.421 ± 0% 3.650 ± 0% +6.70% (p=0.000 n=10) FmtFprintfEmpty 94.50n ± 0% 94.45n ± 0% -0.05% (p=0.000 n=10) FmtFprintfString 154.1n ± 1% 155.2n ± 0% ~ (p=0.689 n=10) FmtFprintfInt 155.2n ± 1% 154.4n ± 0% ~ (p=0.785 n=10) FmtFprintfIntInt 238.0n ± 1% 237.1n ± 0% ~ (p=0.721 n=10) FmtFprintfPrefixedInt 334.6n ± 0% 312.8n ± 0% -6.52% (p=0.000 n=10) FmtFprintfFloat 396.4n ± 0% 390.5n ± 0% -1.49% (p=0.000 n=10) FmtManyArgs 923.4n ± 0% 905.0n ± 0% -2.00% (p=0.000 n=10) GobDecode 15.17m ± 1% 14.93m ± 1% -1.59% (p=0.000 n=10) GobEncode 17.62m ± 0% 17.33m ± 0% -1.65% (p=0.001 n=10) Gzip 405.4m ± 0% 404.3m ± 0% -0.26% (p=0.000 n=10) Gunzip 87.20m ± 0% 80.92m ± 0% -7.20% (p=0.000 n=10) HTTPClientServer 86.92µ ± 1% 86.14µ ± 0% -0.90% (p=0.000 n=10) JSONEncode 18.66m ± 0% 18.49m ± 0% -0.91% (p=0.000 n=10) JSONDecode 75.77m ± 1% 77.34m ± 1% +2.07% (p=0.000 n=10) Mandelbrot200 7.215m ± 0% 6.521m ± 0% -9.62% (p=0.000 n=10) GoParse 7.407m ± 1% 7.324m ± 1% -1.12% (p=0.003 n=10) RegexpMatchEasy0_32 134.3n ± 0% 134.6n ± 0% +0.22% (p=0.000 n=10) RegexpMatchEasy0_1K 1.544µ ± 0% 1.365µ ± 0% -11.63% (p=0.000 n=10) RegexpMatchEasy1_32 165.4n ± 0% 164.1n ± 0% -0.79% (p=0.000 n=10) RegexpMatchEasy1_1K 1.629µ ± 0% 1.492µ ± 0% -8.41% (p=0.000 n=10) RegexpMatchMedium_32 1.413µ ± 0% 1.404µ ± 0% -0.64% (p=0.000 n=10) RegexpMatchMedium_1K 41.59µ ± 0% 41.05µ ± 0% -1.28% (p=0.000 n=10) RegexpMatchHard_32 2.060µ ± 0% 2.072µ ± 0% +0.58% (p=0.000 n=10) RegexpMatchHard_1K 61.30µ ± 0% 60.89µ ± 0% -0.68% (p=0.000 n=10) Revcomp 1.357 ± 0% 1.199 ± 1% -11.64% (p=0.000 n=10) Template 110.6m ± 2% 112.3m ± 2% ~ (p=0.105 n=10) TimeParse 411.7n ± 0% 414.2n ± 1% +0.60% (p=0.000 n=10) TimeFormat 499.9n ± 0% 496.9n ± 0% -0.60% (p=0.000 n=10) geomean 103.0µ 101.0µ -1.98% │ CL 479498 v11 │ this CL │ │ B/s │ B/s vs base │ GobDecode 48.24Mi ± 1% 49.02Mi ± 1% +1.62% (p=0.000 n=10) GobEncode 41.53Mi ± 0% 42.23Mi ± 0% +1.69% (p=0.001 n=10) Gzip 45.65Mi ± 0% 45.77Mi ± 0% +0.25% (p=0.000 n=10) Gunzip 212.2Mi ± 0% 228.7Mi ± 0% +7.76% (p=0.000 n=10) JSONEncode 99.18Mi ± 0% 100.08Mi ± 0% +0.91% (p=0.000 n=10) JSONDecode 24.42Mi ± 1% 23.93Mi ± 1% -2.03% (p=0.000 n=10) GoParse 7.458Mi ± 1% 7.544Mi ± 1% +1.15% (p=0.001 n=10) RegexpMatchEasy0_32 227.3Mi ± 0% 226.8Mi ± 0% -0.21% (p=0.000 n=10) RegexpMatchEasy0_1K 632.5Mi ± 0% 715.7Mi ± 0% +13.15% (p=0.000 n=10) RegexpMatchEasy1_32 184.5Mi ± 0% 186.0Mi ± 0% +0.81% (p=0.000 n=10) RegexpMatchEasy1_1K 599.4Mi ± 0% 654.3Mi ± 0% +9.17% (p=0.000 n=10) RegexpMatchMedium_32 21.60Mi ± 0% 21.74Mi ± 0% +0.64% (p=0.000 n=10) RegexpMatchMedium_1K 23.48Mi ± 0% 23.78Mi ± 0% +1.30% (p=0.000 n=10) RegexpMatchHard_32 14.81Mi ± 0% 14.72Mi ± 0% -0.58% (p=0.000 n=10) RegexpMatchHard_1K 15.93Mi ± 0% 16.04Mi ± 0% +0.72% (p=0.000 n=10) Revcomp 178.6Mi ± 0% 202.2Mi ± 1% +13.18% (p=0.000 n=10) Template 16.73Mi ± 2% 16.48Mi ± 2% ~ (p=0.093 n=10) geomean 60.58Mi 62.23Mi +2.72% The only significant regression is the Fannkuch11 case; perf records are manually inspected, with the hottest part of the code virtually unchanged except for the alignment of two instructions, that seems to sit at different sides of a 32- or even 64-byte boundary. So again, the regression is likely due to micro-architecture quirks, and the change is in fact a win across the board. Updates golang#59120 Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65 (cherry picked from commit 03e1790d8d84c3955b0294992f1d7b6b7693ed3f)
… for loong64 For the SubFromLen64 codegen test case to work as intended, we need to fold c-(-(x-d)) into x+(c-d). Still, some instances of LeadingZeros are not optimized into single CLZ instructions right now (actually, the LeadingZeros micro-benchmarks are currently still compiled with redundant adds/subs of 64, due to interference of loop optimizations before lowering), but perf numbers indicate it's not that bad after all. Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ LeadingZeros 3.675n ± 0% 1.545n ± 1% -57.96% (p=0.000 n=10) LeadingZeros8 2.001n ± 0% 1.868n ± 0% -6.62% (p=0.000 n=10) LeadingZeros16 3.144n ± 0% 1.864n ± 1% -40.71% (p=0.000 n=10) LeadingZeros32 4.265n ± 1% 1.653n ± 1% -61.24% (p=0.000 n=10) LeadingZeros64 3.962n ± 0% 1.539n ± 0% -61.16% (p=0.000 n=10) geomean 3.299n 1.688n -48.84% go1 benchmark results on the same box: goos: linux goarch: loong64 pkg: test/bench/go1 │ CL 483355 │ this CL │ │ sec/op │ sec/op vs base │ BinaryTree17 13.75 ± 2% 13.70 ± 2% ~ (p=0.579 n=10) Fannkuch11 3.650 ± 0% 3.415 ± 0% -6.46% (p=0.000 n=10) FmtFprintfEmpty 94.45n ± 0% 94.98n ± 0% +0.56% (p=0.000 n=10) FmtFprintfString 155.2n ± 0% 151.1n ± 0% -2.61% (p=0.000 n=10) FmtFprintfInt 154.4n ± 0% 153.6n ± 0% -0.52% (p=0.000 n=10) FmtFprintfIntInt 237.1n ± 0% 234.7n ± 0% -0.99% (p=0.000 n=10) FmtFprintfPrefixedInt 312.8n ± 0% 314.2n ± 0% +0.45% (p=0.000 n=10) FmtFprintfFloat 390.5n ± 0% 402.1n ± 0% +2.97% (p=0.000 n=10) FmtManyArgs 905.0n ± 0% 918.6n ± 0% +1.51% (p=0.000 n=10) GobDecode 14.93m ± 1% 14.98m ± 1% +0.33% (p=0.015 n=10) GobEncode 17.33m ± 0% 17.26m ± 1% -0.39% (p=0.023 n=10) Gzip 404.3m ± 0% 404.6m ± 0% +0.08% (p=0.000 n=10) Gunzip 80.92m ± 0% 80.97m ± 0% +0.06% (p=0.000 n=10) HTTPClientServer 86.14µ ± 0% 84.39µ ± 0% -2.03% (p=0.000 n=10) JSONEncode 18.49m ± 0% 18.50m ± 0% ~ (p=0.436 n=10) JSONDecode 77.34m ± 1% 76.26m ± 1% -1.40% (p=0.000 n=10) Mandelbrot200 6.521m ± 0% 6.508m ± 0% ~ (p=0.138 n=10) GoParse 7.324m ± 1% 7.413m ± 1% +1.22% (p=0.005 n=10) RegexpMatchEasy0_32 134.6n ± 0% 134.6n ± 0% ~ (p=0.195 n=10) RegexpMatchEasy0_1K 1.365µ ± 0% 1.366µ ± 0% +0.07% (p=0.038 n=10) RegexpMatchEasy1_32 164.1n ± 0% 164.1n ± 0% ~ (p=0.230 n=10) RegexpMatchEasy1_1K 1.492µ ± 0% 1.492µ ± 0% ~ (p=0.211 n=10) RegexpMatchMedium_32 1.404µ ± 0% 1.403µ ± 0% -0.07% (p=0.000 n=10) RegexpMatchMedium_1K 41.05µ ± 0% 41.04µ ± 0% -0.04% (p=0.000 n=10) RegexpMatchHard_32 2.072µ ± 0% 2.071µ ± 0% -0.05% (p=0.000 n=10) RegexpMatchHard_1K 60.89µ ± 0% 60.87µ ± 0% -0.04% (p=0.000 n=10) Revcomp 1.199 ± 1% 1.200 ± 0% ~ (p=0.481 n=10) Template 112.3m ± 2% 112.9m ± 2% ~ (p=0.353 n=10) TimeParse 414.2n ± 1% 412.5n ± 0% -0.40% (p=0.000 n=10) TimeFormat 496.9n ± 0% 496.6n ± 0% ~ (p=0.341 n=10) geomean 101.0µ 100.7µ -0.26% │ CL 483355 │ this CL │ │ B/s │ B/s vs base │ GobDecode 49.02Mi ± 1% 48.87Mi ± 1% -0.32% (p=0.014 n=10) GobEncode 42.23Mi ± 0% 42.40Mi ± 1% +0.40% (p=0.022 n=10) Gzip 45.77Mi ± 0% 45.73Mi ± 0% -0.07% (p=0.000 n=10) Gunzip 228.7Mi ± 0% 228.6Mi ± 0% -0.06% (p=0.000 n=10) JSONEncode 100.1Mi ± 0% 100.0Mi ± 0% ~ (p=0.470 n=10) JSONDecode 23.93Mi ± 1% 24.27Mi ± 1% +1.43% (p=0.000 n=10) GoParse 7.544Mi ± 1% 7.448Mi ± 1% -1.26% (p=0.005 n=10) RegexpMatchEasy0_32 226.8Mi ± 0% 226.7Mi ± 0% -0.06% (p=0.001 n=10) RegexpMatchEasy0_1K 715.7Mi ± 0% 715.1Mi ± 0% -0.08% (p=0.022 n=10) RegexpMatchEasy1_32 186.0Mi ± 0% 186.0Mi ± 0% ~ (p=0.493 n=10) RegexpMatchEasy1_1K 654.3Mi ± 0% 654.6Mi ± 0% +0.04% (p=0.000 n=10) RegexpMatchMedium_32 21.74Mi ± 0% 21.74Mi ± 0% +0.02% (p=0.022 n=10) RegexpMatchMedium_1K 23.78Mi ± 0% 23.79Mi ± 0% +0.04% (p=0.000 n=10) RegexpMatchHard_32 14.72Mi ± 0% 14.73Mi ± 0% +0.06% (p=0.000 n=10) RegexpMatchHard_1K 16.04Mi ± 0% 16.04Mi ± 0% ~ (p=1.000 n=10) ¹ Revcomp 202.2Mi ± 1% 202.0Mi ± 0% ~ (p=0.469 n=10) Template 16.48Mi ± 2% 16.38Mi ± 2% ~ (p=0.342 n=10) geomean 62.23Mi 62.21Mi -0.04% ¹ all samples are equal In this case though, all significant perf changes are likely due to micro-architectural quirks. Updates golang#59120 Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89 (cherry picked from commit 80a298243a07e982573e14723d8133fc5be45065)
…nsics for loong64 Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ ReverseBytes 3.0130n ± 0% 0.6517n ± 2% -78.37% (p=0.000 n=10) ReverseBytes16 0.9027n ± 0% 0.6526n ± 2% -27.71% (p=0.000 n=10) ReverseBytes32 1.7040n ± 0% 0.6511n ± 1% -61.79% (p=0.000 n=10) ReverseBytes64 2.7080n ± 0% 0.6499n ± 1% -76.00% (p=0.000 n=10) geomean 1.882n 0.6513n -65.40% Go1 benchmark results indicate no meaningful change except for micro-architecture-related fluctuations. Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac (cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)
Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ Reverse 4.2280n ± 0% 0.8029n ± 0% -81.01% (p=0.000 n=10) Reverse8 1.0050n ± 0% 0.8029n ± 0% -20.11% (p=0.000 n=10) Reverse16 1.9600n ± 0% 0.8029n ± 0% -59.04% (p=0.000 n=10) Reverse32 4.0205n ± 0% 0.8029n ± 0% -80.03% (p=0.000 n=10) Reverse64 4.0360n ± 0% 0.8029n ± 0% -80.11% (p=0.000 n=10) geomean 2.668n 0.8029n -69.90% The operation seems unused anywhere else in the tree except in compress/flate, of which a very slight (time geomean -0.16%, throughput geomean +0.16%) improvement was observed with the change applied. Updates golang#59120 Change-Id: Ie1b446386655e0bb6808e435257293c30420626e (cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)
…nsics for loong64 Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ ReverseBytes 3.0130n ± 0% 0.6517n ± 2% -78.37% (p=0.000 n=10) ReverseBytes16 0.9027n ± 0% 0.6526n ± 2% -27.71% (p=0.000 n=10) ReverseBytes32 1.7040n ± 0% 0.6511n ± 1% -61.79% (p=0.000 n=10) ReverseBytes64 2.7080n ± 0% 0.6499n ± 1% -76.00% (p=0.000 n=10) geomean 1.882n 0.6513n -65.40% Go1 benchmark results indicate no meaningful change except for micro-architecture-related fluctuations. Updates golang#59120 Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac [xen0n: removed Bswap16 because go1.20 doesn't support this op] (cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)
Micro-benchmark results on Loongson 3A5000: goos: linux goarch: loong64 pkg: math/bits │ before │ after │ │ sec/op │ sec/op vs base │ Reverse 4.2280n ± 0% 0.8029n ± 0% -81.01% (p=0.000 n=10) Reverse8 1.0050n ± 0% 0.8029n ± 0% -20.11% (p=0.000 n=10) Reverse16 1.9600n ± 0% 0.8029n ± 0% -59.04% (p=0.000 n=10) Reverse32 4.0205n ± 0% 0.8029n ± 0% -80.03% (p=0.000 n=10) Reverse64 4.0360n ± 0% 0.8029n ± 0% -80.11% (p=0.000 n=10) geomean 2.668n 0.8029n -69.90% The operation seems unused anywhere else in the tree except in compress/flate, of which a very slight (time geomean -0.16%, throughput geomean +0.16%) improvement was observed with the change applied. Updates golang#59120 Change-Id: Ie1b446386655e0bb6808e435257293c30420626e (cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)
Change https://go.dev/cl/577515 mentions this issue: |
Change https://go.dev/cl/580280 mentions this issue: |
Change https://go.dev/cl/580283 mentions this issue: |
Make math.{Min,Max} intrinsics and implement math.{archMax,archMin} in hardware. goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 7.606n ± 0% 3.087n ± 0% -59.41% (p=0.000 n=20) Min 7.205n ± 0% 2.904n ± 0% -59.69% (p=0.000 n=20) MinFloat 37.220n ± 0% 4.802n ± 0% -87.10% (p=0.000 n=20) MaxFloat 33.620n ± 0% 4.802n ± 0% -85.72% (p=0.000 n=20) geomean 16.18n 3.792n -76.57% goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3A5000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 10.010n ± 0% 7.196n ± 0% -28.11% (p=0.000 n=20) Min 8.806n ± 0% 7.155n ± 0% -18.75% (p=0.000 n=20) MinFloat 60.010n ± 0% 7.976n ± 0% -86.71% (p=0.000 n=20) MaxFloat 56.410n ± 0% 7.980n ± 0% -85.85% (p=0.000 n=20) geomean 23.37n 7.566n -67.63% Updates #59120. Change-Id: I6815d20bc304af3cbf5d6ca8fe0ca1c2ddebea2d Reviewed-on: https://go-review.googlesource.com/c/go/+/580283 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>
goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Copysign 1.9710n ± 0% 0.8006n ± 0% -59.38% (p=0.000 n=10) Abs 1.8745n ± 0% 0.8006n ± 0% -57.29% (p=0.000 n=10) geomean 1.922n 0.8006n -58.35% goos: linux goarch: loong64 pkg: math cpu: Loongson-3A5000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Copysign 2.4020n ± 0% 0.9006n ± 0% -62.51% (p=0.000 n=10) Abs 2.4020n ± 0% 0.8005n ± 0% -66.67% (p=0.000 n=10) geomean 2.402n 0.8491n -64.65% Updates #59120. Change-Id: Ic409e1f4d15ad15cb3568a5aaa100046e9302842 Reviewed-on: https://go-review.googlesource.com/c/go/+/580280 Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>
Change https://go.dev/cl/624575 mentions this issue: |
Change https://go.dev/cl/624576 mentions this issue: |
Change https://go.dev/cl/624276 mentions this issue: |
For the SubFromLen64 codegen test case to work as intended, we need to fold c-(-(x-d)) into x+(c-d). Still, some instances of LeadingZeros are not optimized into single CLZ instructions right now (actually, the LeadingZeros micro-benchmarks are currently still compiled with redundant adds/subs of 64, due to interference of loop optimizations before lowering), but perf numbers indicate it's not that bad after all. Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 3.660n ± 0% 1.348n ± 0% -63.17% (p=0.000 n=20) LeadingZeros8 1.777n ± 0% 1.767n ± 0% -0.56% (p=0.000 n=20) LeadingZeros16 2.816n ± 0% 1.770n ± 0% -37.14% (p=0.000 n=20) LeadingZeros32 5.293n ± 1% 1.683n ± 0% -68.21% (p=0.000 n=20) LeadingZeros64 3.622n ± 0% 1.349n ± 0% -62.76% (p=0.000 n=20) geomean 3.229n 1.571n -51.35% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 2.410n ± 0% 1.103n ± 1% -54.23% (p=0.000 n=20) LeadingZeros8 1.236n ± 0% 1.501n ± 0% +21.44% (p=0.000 n=20) LeadingZeros16 2.106n ± 0% 1.501n ± 0% -28.73% (p=0.000 n=20) LeadingZeros32 2.860n ± 0% 1.324n ± 0% -53.72% (p=0.000 n=20) LeadingZeros64 2.6135n ± 0% 0.9509n ± 0% -63.62% (p=0.000 n=20) geomean 2.159n 1.256n -41.81% Updates #59120 This patch is a copy of CL 483356. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: Iee81a17f7da06d77a427e73dfcc016f2b15ae556 Reviewed-on: https://go-review.googlesource.com/c/go/+/624575 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | ReverseBytes 2.0020n ± 0% 0.4040n ± 0% -79.82% (p=0.000 n=20) ReverseBytes16 0.8866n ± 1% 0.8007n ± 0% -9.69% (p=0.000 n=20) ReverseBytes32 1.2195n ± 0% 0.8007n ± 0% -34.34% (p=0.000 n=20) ReverseBytes64 2.0705n ± 0% 0.8008n ± 0% -61.32% (p=0.000 n=20) geomean 1.455n 0.6749n -53.62% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | ReverseBytes 2.8040n ± 0% 0.5205n ± 0% -81.44% (p=0.000 n=20) ReverseBytes16 0.7066n ± 0% 0.8011n ± 0% +13.37% (p=0.000 n=20) ReverseBytes32 1.5500n ± 0% 0.8010n ± 0% -48.32% (p=0.000 n=20) ReverseBytes64 2.7665n ± 0% 0.8010n ± 0% -71.05% (p=0.000 n=20) geomean 1.707n 0.7192n -57.87% Updates #59120 This patch is a copy of CL 483357. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: If355354cd031533df91991fcc3392e5a6c314295 Reviewed-on: https://go-review.googlesource.com/c/go/+/624576 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
Change https://go.dev/cl/625335 mentions this issue: |
Benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | FMA 25.930n ± 0% 2.002n ± 0% -92.28% (p=0.000 n=10) goos: linux goarch: loong64 pkg: math cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | FMA 32.840n ± 0% 2.002n ± 0% -93.90% (p=0.000 n=10) Updates #59120 This patch is a copy of CL 483355. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: I88b89d23f00864f9173a182a47ee135afec7ed6e Reviewed-on: https://go-review.googlesource.com/c/go/+/625335 Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
Use Loong64's atomic operation instruction AMSWAPDB{W,V} (full barrier) to implement atomic.Xchg{32,64} goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz | old.bench | new.bench | | sec/op | sec/op vs base | Xchg 26.44n ± 0% 12.01n ± 0% -54.58% (p=0.000 n=20) Xchg-2 30.10n ± 0% 25.58n ± 0% -15.02% (p=0.000 n=20) Xchg-4 30.06n ± 0% 24.82n ± 0% -17.43% (p=0.000 n=20) Xchg64 26.44n ± 0% 12.02n ± 0% -54.54% (p=0.000 n=20) Xchg64-2 30.10n ± 0% 25.57n ± 0% -15.05% (p=0.000 n=20) Xchg64-4 30.05n ± 0% 24.80n ± 0% -17.47% (p=0.000 n=20) geomean 28.81n 19.68n -31.69% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | old.bench | new.bench | | sec/op | sec/op vs base | Xchg 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg-4 34.63n ± 0% 19.59n ± 0% -43.42% (p=0.000 n=20) Xchg64 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg64-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg64-4 34.67n ± 0% 19.59n ± 0% -43.50% (p=0.000 n=20) geomean 31.44n 17.11n -45.59% Updates #59120. Change-Id: Ied74fc20338b63799c6d6eeb122c31b42cff0f7e Reviewed-on: https://go-review.googlesource.com/c/go/+/481578 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: WANG Xuerui <git@xen0n.name> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | CL 624576 | this CL | | sec/op | sec/op vs base | Reverse 2.8130n ± 0% 0.8008n ± 0% -71.53% (p=0.000 n=20) Reverse8 0.7014n ± 0% 0.4040n ± 0% -42.40% (p=0.000 n=20) Reverse16 1.2975n ± 0% 0.6632n ± 1% -48.89% (p=0.000 n=20) Reverse32 2.7520n ± 0% 0.4042n ± 0% -85.31% (p=0.000 n=20) Reverse64 2.8970n ± 0% 0.4041n ± 0% -86.05% (p=0.000 n=20) geomean 1.828n 0.5116n -72.01% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | CL 624576 | this CL | | sec/op | sec/op vs base | Reverse 4.0050n ± 0% 0.8011n ± 0% -80.00% (p=0.000 n=20) Reverse8 0.8010n ± 0% 0.5210n ± 1% -34.96% (p=0.000 n=20) Reverse16 1.6160n ± 0% 0.6008n ± 0% -62.82% (p=0.000 n=20) Reverse32 3.8550n ± 0% 0.5179n ± 0% -86.57% (p=0.000 n=20) Reverse64 3.8050n ± 0% 0.5177n ± 0% -86.40% (p=0.000 n=20) geomean 2.378n 0.5828n -75.49% Updates #59120 This patch is a copy of CL 483656. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: I98681091763279279c8404bd0295785f13ea1c8e Reviewed-on: https://go-review.googlesource.com/c/go/+/624276 Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>
This issue is mainly for tracking the implementation progress of various low-hanging fruits regarding
loong64
optimizations.There are many missed optimization chances on
loong64
. A quick survey on SSA intrinsics uncovers:runtime.publicationBarrier
dmb st
onarm64
dbar 0
on LA64 v1.00dbar <TBD>
on next revision of LA64 (finer-grained barriers are to be supported)runtime.Bswap{32,64}
revb.{2w,d}
on LA64 v1.00: CL 483357runtime/internal/sys.Prefetch{,Streamed}
preld
on LA64 v1.00runtime/internal/atomic.{And,Or}
am{and,or}.d
on LA64 v1.00: CL 482756not possible with LA64 v1.00math.{Trunc,Ceil,Floor,RoundToEven}
frint.[sd]
is not orthogonal: no fixed rounding mode variants (unlike e.g.ftintr{m,p,z,ne}
).math.Round
frint.[sd]
on LA64 v1.00 -- have to check if the rounding mode behavior is tolerablemath.Abs
fabs.[sd]
on LA64 v1.00math.Copysign
fcopysign.[sd]
on LA64 v1.00math.FMA
f{,n}m{add,sub}.[sd]
on LA64 v1.00: CL 483355math/bits.TrailingZeros{64,32}
(ssa.OpCtz{64,32}
)ctz.[wd]
on LA64 v1.00: CL 479498math/bits.Len{64,32,}
(ssa.OpBitLen{64,32}
)clz.[wd]
on LA64 v1.00: CL 483356significant performance regression across the board, needs investigationconfirmed to be micro-architecture quirk, alleviated somewhat by various alignment tricksmath/bits.Reverse{64,32,8}
(ssa.OpBitRev{64,32,8}
)bitrev.{d,w,4b}
on LA64 v1.00: CL 483656We may want to implement (and preferably benchmark) all of the above.
cc @golang/loong64
The text was updated successfully, but these errors were encountered: