Skip to content

Performance regression of power operation #58282

Closed
@giordano

Description

@giordano

On d46b665 the power operation with non-literal exponent is slower compared to Julia v1.11.5:

julia> using BenchmarkTools

julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  2.416 ns … 25.083 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.500 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.517 ns ±  0.375 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂             █              ▅             ▁ ▁
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
  2.42 ns      Histogram: log(frequency) by time     2.58 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> code_native((Float32,); debuginfo=:none) do b b ^ (1 + 2) end
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 14, 0
	.globl	"_julia_#7_5040"                ; -- Begin function julia_#7_5040
	.p2align	2
"_julia_#7_5040":                       ; @"julia_#7_5040"
; Function Signature: var"#7"(Float32)
; %bb.0:                                ; %top
	;DEBUG_VALUE: #7:b <- $s0
	;DEBUG_VALUE: #7:b <- $s0
	fmul	s1, s0, s0
	fmul	s0, s1, s0
	ret
                                        ; -- End function
.subsections_via_symbols

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

vs

julia> using BenchmarkTools

julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  3.333 ns … 19.875 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.458 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.454 ns ±  0.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁           ▇          █           ▃            ▁
  ▄▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
  3.33 ns      Histogram: log(frequency) by time     3.54 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> code_native((Float32,); debuginfo=:none) do b b ^ (1 + 2) end
	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 14, 0
	.globl	"_julia_#14_6264"               ; -- Begin function julia_#14_6264
	.p2align	2
"_julia_#14_6264":                      ; @"julia_#14_6264"
; Function Signature: var"#14"(Float32)
; %bb.0:                                ; %top
	;DEBUG_VALUE: #14:b <- $s0
	;DEBUG_VALUE: #14:b <- $s0
	stp	x29, x30, [sp, #-16]!           ; 16-byte Folded Spill
	mov	x29, sp
	fmul	s1, s0, s0
	fmul	s0, s1, s0
	ldp	x29, x30, [sp], #16             ; 16-byte Folded Reload
	ret
                                        ; -- End function
.subsections_via_symbols

julia> versioninfo()
Julia Version 1.13.0-DEV.471
Commit d46b665067* (2025-04-29 18:02 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin23.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.7 (ORCJIT, apple-m1)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 4 virtual cores)

Note that the the native code has now two extra instructions (stp and ldp). The LLVM IR code is the same for both version, but I'm not sure this can be attributed just to LLVM, with the same LLVM code I see the same native code in https://godbolt.org/z/oaW3nbrnb.

I see a regression also on x86_64:

julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  1.912 ns … 8.272 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.923 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.928 ns ± 0.140 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▄                        ▄ █                          ▃ ▁
  █▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁█ █
  1.91 ns     Histogram: log(frequency) by time     1.93 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> code_native((Float32,)) do b b ^ (1 + 2) end
        .text
        .file   "#11"
        .globl  "julia_#11_5539"                # -- Begin function julia_#11_5539
        .p2align        4, 0x90
        .type   "julia_#11_5539",@function
"julia_#11_5539":                       # @"julia_#11_5539"
; Function Signature: var"#11"(Float32)
; ┌ @ REPL[13]:1 within `#11`
# %bb.0:                                # %top
; │ @ REPL[13] within `#11`
        #DEBUG_VALUE: #11:b <- $xmm0
        push    rbp
; │ @ REPL[13]:1 within `#11`
; │┌ @ math.jl:1230 within `^`
; ││┌ @ operators.jl:596 within `*` @ float.jl:493
        vmulss  xmm1, xmm0, xmm0
        mov     rbp, rsp
        vmulss  xmm0, xmm1, xmm0
; ││└
        pop     rbp
        ret
.Lfunc_end0:
        .size   "julia_#11_5539", .Lfunc_end0-"julia_#11_5539"
; └└
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)

vs

julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
 Range (min … max):  2.123 ns … 7.170 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.213 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.217 ns ± 0.158 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                       █ ▂
  ▃▁▅▁▄▁▁▃▁▄▁▄▃▁▃▁▆▁█▅▁█▁█▁▁▅▁▃▁▂▂▁▂▁▂▁▂▂▁▂▁▂▁▁▂▁▂▁▁▂▁▂▁▄▁▂ ▃
  2.12 ns        Histogram: frequency by time       2.36 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> code_native((Float32,)) do b b ^ (1 + 2) end
        .text
        .file   "#23"
        .section        .ltext,"axl",@progbits
        .globl  "julia_#23_4919"                # -- Begin function julia_#23_4919
        .p2align        4, 0x90
        .type   "julia_#23_4919",@function
"julia_#23_4919":                       # @"julia_#23_4919"
; Function Signature: var"#23"(Float32)
; ┌ @ REPL[17]:1 within `#23`
# %bb.0:                                # %top
        #DEBUG_VALUE: #23:b <- $xmm0
        push    rbp
        mov     rbp, rsp
; │┌ @ math.jl:1268 within `^`
; ││┌ @ math.jl:1208 within `pow_body`
; │││┌ @ operators.jl:643 within `*` @ float.jl:494
        vmulss  xmm1, xmm0, xmm0
        vmulss  xmm0, xmm1, xmm0
; ││└└
        pop     rbp
        ret
.Lfunc_end0:
        .size   "julia_#23_4919", .Lfunc_end0-"julia_#23_4919"
; └└
                                        # -- End function
        .section        ".note.GNU-stack","",@progbits

julia> versioninfo()
Julia Version 1.13.0-DEV.456
Commit 175ef3eb02c (2025-04-28 13:13 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
  WORD_SIZE: 64
  LLVM: libLLVM-19.1.7 (ORCJIT, znver4)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 192 virtual cores)

Here the assembly is much closer between the two versions, but mov and vmulss are swapped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    mathsMathematical functionsperformanceMust go fasterregressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions