Open
Description
As @ToucheSir suggested, the Discourse post is opened here as well. Also I have used MKL.jl
as the suggestion.
Example:
using LinearAlgebra: BLAS
BLAS.vendor() # :mkl
using BenchmarkTools
using Flux
conv = Conv((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
conv(dummy) # ignore compile time
@benchmark conv(dummy)
#=
BenchmarkTools.Trial:
memory estimate: 19.29 MiB
allocs estimate: 45
--------------
minimum time: 5.384 ms (0.00% GC)
median time: 5.825 ms (0.00% GC)
mean time: 6.245 ms (5.38% GC)
maximum time: 7.804 ms (17.60% GC)
--------------
samples: 800
evals/sample: 1
=#
cor = Flux.CrossCor((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
cor(dummy) # ignore compile time
@benchmark cor(dummy)
#=
BenchmarkTools.Trial:
memory estimate: 19.29 MiB
allocs estimate: 51
--------------
minimum time: 5.518 ms (0.00% GC)
median time: 5.806 ms (0.00% GC)
mean time: 6.042 ms (4.94% GC)
maximum time: 8.333 ms (13.14% GC)
--------------
samples: 827
evals/sample: 1
=#
@benchmark conv($dummy)
#=
BenchmarkTools.Trial:
memory estimate: 19.29 MiB
allocs estimate: 45
--------------
minimum time: 5.503 ms (0.00% GC)
median time: 5.786 ms (0.00% GC)
mean time: 6.026 ms (4.99% GC)
maximum time: 8.327 ms (12.01% GC)
--------------
samples: 830
evals/sample: 1
=#
It seems that @benchmark
with PyCall
will cause memory leak, so here're @time
s with PyCall
multiple times manually, but it's still clear that the 2x gap exists.
using PyCall
const torch = pyimport_conda("torch", "torch")
const nn = pyimport("torch.nn")
dummy_py = torch.randn(2, 3, 224, 224);
conv1_py = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=true)
conv1_py(dummy_py)
@time conv1_py(dummy_py) # 0.002447 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002318 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002344 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002278 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002603 seconds (3 allocations: 144 bytes)
@time conv(dummy) # 0.015352 seconds (45 allocations: 19.287 MiB, 49.63% gc time)
@time conv(dummy) # 0.005321 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005424 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005461 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005627 seconds (45 allocations: 19.287 MiB)
Though memory usage of PyCall looks fishy, @ToucheSir's result in Python gives a similar result.