`Conv` is 2x slow than pytorch `Conv` on cpu

As @ToucheSir suggested, the Discourse [post](https://discourse.julialang.org/t/conv-is-2x-slow-than-pytorch-conv-on-cpu/39802/8) is opened here as well. Also I have used `MKL.jl` as the suggestion.

Example:
```julia
using LinearAlgebra: BLAS
BLAS.vendor()  # :mkl

using BenchmarkTools
using Flux

conv = Conv((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
conv(dummy)  # ignore compile time
@benchmark conv(dummy) 

#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  45
  --------------
  minimum time:     5.384 ms (0.00% GC)
  median time:      5.825 ms (0.00% GC)
  mean time:        6.245 ms (5.38% GC)
  maximum time:     7.804 ms (17.60% GC)
  --------------
  samples:          800
  evals/sample:     1
=#

cor = Flux.CrossCor((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
cor(dummy)  # ignore compile time
@benchmark cor(dummy) 
#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  51
  --------------
  minimum time:     5.518 ms (0.00% GC)
  median time:      5.806 ms (0.00% GC)
  mean time:        6.042 ms (4.94% GC)
  maximum time:     8.333 ms (13.14% GC)
  --------------
  samples:          827
  evals/sample:     1
=#

@benchmark conv($dummy) 
#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  45
  --------------
  minimum time:     5.503 ms (0.00% GC)
  median time:      5.786 ms (0.00% GC)
  mean time:        6.026 ms (4.99% GC)
  maximum time:     8.327 ms (12.01% GC)
  --------------
  samples:          830
  evals/sample:     1
=#
```

It seems that `@benchmark` with `PyCall` will cause memory leak, so here're `@time`s with `PyCall` multiple times manually, but it's still clear that the 2x gap exists.

```julia
using PyCall
const torch = pyimport_conda("torch", "torch")
const nn = pyimport("torch.nn")
dummy_py = torch.randn(2, 3, 224, 224);
conv1_py = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=true)
conv1_py(dummy_py)

@time conv1_py(dummy_py) # 0.002447 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002318 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002344 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002278 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002603 seconds (3 allocations: 144 bytes)

@time conv(dummy) # 0.015352 seconds (45 allocations: 19.287 MiB, 49.63% gc time)
@time conv(dummy) # 0.005321 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005424 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005461 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005627 seconds (45 allocations: 19.287 MiB)
```

Though memory usage of PyCall looks fishy, @ToucheSir's [result](https://discourse.julialang.org/t/conv-is-2x-slow-than-pytorch-conv-on-cpu/39802/8?u=yiyuezhuo) in Python gives a similar result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`Conv` is 2x slow than pytorch `Conv` on cpu #234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Conv is 2x slow than pytorch Conv on cpu #234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Conv` is 2x slow than pytorch `Conv` on cpu #234