Skip to content

Conv is 2x slow than pytorch Conv on cpu #234

Open
@yiyuezhuo

Description

@yiyuezhuo

As @ToucheSir suggested, the Discourse post is opened here as well. Also I have used MKL.jl as the suggestion.

Example:

using LinearAlgebra: BLAS
BLAS.vendor()  # :mkl

using BenchmarkTools
using Flux

conv = Conv((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
conv(dummy)  # ignore compile time
@benchmark conv(dummy) 

#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  45
  --------------
  minimum time:     5.384 ms (0.00% GC)
  median time:      5.825 ms (0.00% GC)
  mean time:        6.245 ms (5.38% GC)
  maximum time:     7.804 ms (17.60% GC)
  --------------
  samples:          800
  evals/sample:     1
=#

cor = Flux.CrossCor((7,7), 3 => 64; stride=2, pad=3)
dummy = randn(Float32, 224, 224, 3, 2);
cor(dummy)  # ignore compile time
@benchmark cor(dummy) 
#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  51
  --------------
  minimum time:     5.518 ms (0.00% GC)
  median time:      5.806 ms (0.00% GC)
  mean time:        6.042 ms (4.94% GC)
  maximum time:     8.333 ms (13.14% GC)
  --------------
  samples:          827
  evals/sample:     1
=#

@benchmark conv($dummy) 
#=
BenchmarkTools.Trial: 
  memory estimate:  19.29 MiB
  allocs estimate:  45
  --------------
  minimum time:     5.503 ms (0.00% GC)
  median time:      5.786 ms (0.00% GC)
  mean time:        6.026 ms (4.99% GC)
  maximum time:     8.327 ms (12.01% GC)
  --------------
  samples:          830
  evals/sample:     1
=#

It seems that @benchmark with PyCall will cause memory leak, so here're @times with PyCall multiple times manually, but it's still clear that the 2x gap exists.

using PyCall
const torch = pyimport_conda("torch", "torch")
const nn = pyimport("torch.nn")
dummy_py = torch.randn(2, 3, 224, 224);
conv1_py = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=true)
conv1_py(dummy_py)

@time conv1_py(dummy_py) # 0.002447 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002318 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002344 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002278 seconds (3 allocations: 144 bytes)
@time conv1_py(dummy_py) # 0.002603 seconds (3 allocations: 144 bytes)

@time conv(dummy) # 0.015352 seconds (45 allocations: 19.287 MiB, 49.63% gc time)
@time conv(dummy) # 0.005321 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005424 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005461 seconds (45 allocations: 19.287 MiB)
@time conv(dummy) # 0.005627 seconds (45 allocations: 19.287 MiB)

Though memory usage of PyCall looks fishy, @ToucheSir's result in Python gives a similar result.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions