Closed
Description
On d46b665 the power operation with non-literal exponent is slower compared to Julia v1.11.5:
julia> using BenchmarkTools
julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 2.416 ns … 25.083 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.500 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.517 ns ± 0.375 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ █ ▅ ▁ ▁
▆▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ █
2.42 ns Histogram: log(frequency) by time 2.58 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> code_native((Float32,); debuginfo=:none) do b b ^ (1 + 2) end
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl "_julia_#7_5040" ; -- Begin function julia_#7_5040
.p2align 2
"_julia_#7_5040": ; @"julia_#7_5040"
; Function Signature: var"#7"(Float32)
; %bb.0: ; %top
;DEBUG_VALUE: #7:b <- $s0
;DEBUG_VALUE: #7:b <- $s0
fmul s1, s0, s0
fmul s0, s1, s0
ret
; -- End function
.subsections_via_symbols
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
vs
julia> using BenchmarkTools
julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 3.333 ns … 19.875 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.458 ns ┊ GC (median): 0.00%
Time (mean ± σ): 3.454 ns ± 0.244 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▇ █ ▃ ▁
▄▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ █
3.33 ns Histogram: log(frequency) by time 3.54 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> code_native((Float32,); debuginfo=:none) do b b ^ (1 + 2) end
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0
.globl "_julia_#14_6264" ; -- Begin function julia_#14_6264
.p2align 2
"_julia_#14_6264": ; @"julia_#14_6264"
; Function Signature: var"#14"(Float32)
; %bb.0: ; %top
;DEBUG_VALUE: #14:b <- $s0
;DEBUG_VALUE: #14:b <- $s0
stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
mov x29, sp
fmul s1, s0, s0
fmul s0, s1, s0
ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
ret
; -- End function
.subsections_via_symbols
julia> versioninfo()
Julia Version 1.13.0-DEV.471
Commit d46b665067* (2025-04-29 18:02 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin23.4.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LLVM: libLLVM-19.1.7 (ORCJIT, apple-m1)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 4 virtual cores)
Note that the the native code has now two extra instructions (stp
and ldp
). The LLVM IR code is the same for both version, but I'm not sure this can be attributed just to LLVM, with the same LLVM code I see the same native code in https://godbolt.org/z/oaW3nbrnb.
I see a regression also on x86_64:
julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 1.912 ns … 8.272 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.923 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.928 ns ± 0.140 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▄ ▄ █ ▃ ▁
█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁█ █
1.91 ns Histogram: log(frequency) by time 1.93 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> code_native((Float32,)) do b b ^ (1 + 2) end
.text
.file "#11"
.globl "julia_#11_5539" # -- Begin function julia_#11_5539
.p2align 4, 0x90
.type "julia_#11_5539",@function
"julia_#11_5539": # @"julia_#11_5539"
; Function Signature: var"#11"(Float32)
; ┌ @ REPL[13]:1 within `#11`
# %bb.0: # %top
; │ @ REPL[13] within `#11`
#DEBUG_VALUE: #11:b <- $xmm0
push rbp
; │ @ REPL[13]:1 within `#11`
; │┌ @ math.jl:1230 within `^`
; ││┌ @ operators.jl:596 within `*` @ float.jl:493
vmulss xmm1, xmm0, xmm0
mov rbp, rsp
vmulss xmm0, xmm1, xmm0
; ││└
pop rbp
ret
.Lfunc_end0:
.size "julia_#11_5539", .Lfunc_end0-"julia_#11_5539"
; └└
# -- End function
.section ".note.GNU-stack","",@progbits
julia> versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 192 virtual cores)
vs
julia> @benchmark b ^ e setup=(b=randn(Float32); e=3)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations per sample.
Range (min … max): 2.123 ns … 7.170 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.213 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.217 ns ± 0.158 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
▃▁▅▁▄▁▁▃▁▄▁▄▃▁▃▁▆▁█▅▁█▁█▁▁▅▁▃▁▂▂▁▂▁▂▁▂▂▁▂▁▂▁▁▂▁▂▁▁▂▁▂▁▄▁▂ ▃
2.12 ns Histogram: frequency by time 2.36 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> code_native((Float32,)) do b b ^ (1 + 2) end
.text
.file "#23"
.section .ltext,"axl",@progbits
.globl "julia_#23_4919" # -- Begin function julia_#23_4919
.p2align 4, 0x90
.type "julia_#23_4919",@function
"julia_#23_4919": # @"julia_#23_4919"
; Function Signature: var"#23"(Float32)
; ┌ @ REPL[17]:1 within `#23`
# %bb.0: # %top
#DEBUG_VALUE: #23:b <- $xmm0
push rbp
mov rbp, rsp
; │┌ @ math.jl:1268 within `^`
; ││┌ @ math.jl:1208 within `pow_body`
; │││┌ @ operators.jl:643 within `*` @ float.jl:494
vmulss xmm1, xmm0, xmm0
vmulss xmm0, xmm1, xmm0
; ││└└
pop rbp
ret
.Lfunc_end0:
.size "julia_#23_4919", .Lfunc_end0-"julia_#23_4919"
; └└
# -- End function
.section ".note.GNU-stack","",@progbits
julia> versioninfo()
Julia Version 1.13.0-DEV.456
Commit 175ef3eb02c (2025-04-28 13:13 UTC)
Build Info:
Official https://julialang.org release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 192 × AMD Ryzen Threadripper PRO 7995WX 96-Cores
WORD_SIZE: 64
LLVM: libLLVM-19.1.7 (ORCJIT, znver4)
GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 192 virtual cores)
Here the assembly is much closer between the two versions, but mov
and vmulss
are swapped.