Skip to content

conv_direct!(): The performance fix #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 30, 2019
Merged

Conversation

staticfloat
Copy link
Contributor

Jesse brought some bad performance characteristics of NNlib.conv_direct!() to my attention; in particular, our very small convolution was abysmal. Looking into it, it seems that it was actually not that big of a change necessary; just re-use some of the other infrastructure in this package (in particular, calc_padding_regions()), and eliminate some unnecessary allocations. When doing this on the kinds of tiny allocations that NNlib.conv_direct!() should excel at, we see some pretty drastic speed ups. Given the following testing harness, we can do some quick comparisons:

using NNlib, BenchmarkTools, FastConv, ImageFiltering

x = randn(Float32, 28, 28, 1, 1)
w = randn(Float32, 3, 3, 1, 1)
cdims = DenseConvDims(x, w; padding=(size(w)[1:end-2] .- 1))
y = zeros(Float32, NNlib.output_size(cdims)..., size(w)[end], size(x)[end])

NNlib.conv_direct!(y, x, w, cdims);

# This actually gives the _old() variant a small advantage!
oy, ox, ow, ocdims = NNlib.insert_singleton_spatial_dimension.((y, x, w, cdims))
@info("Old NNlib.conv_direct!():")
@btime NNlib.conv_direct_old!(oy, ox, ow, ocdims);
@info("New NNlib.conv_direct!():")
@btime NNlib.conv_direct!(y, x, w, cdims);
@info("FastConv.convn!():")
@btime FastConv.convn!(y[:,:,1,1], x[:,:,1,1], w[:,:,1,1])
@info("ImageFiltering.imfilter():")
@btime imfilter(x[:,:], centered(w[:,:]));
@info("NNlib.conv_im2col!():")
@btime NNlib.conv_im2col!(y, x, w, cdims);
@info("NNlib.conv_nnpack!():")
@btime NNlib.conv_nnpack!(y, x, w, cdims);

Running this with the given small sizes above we get the following timings:

# (28x28 image, 3x3 kernel, single channel and batch)
[ Info: Old NNlib.conv_direct!():
  345.572 μs (4511 allocations: 403.13 KiB)
[ Info: New NNlib.conv_direct!():
  5.226 μs (22 allocations: 1.19 KiB)
[ Info: FastConv.convn!():
  8.838 μs (12 allocations: 7.38 KiB)
[ Info: ImageFiltering.imfilter():
  10.513 μs (43 allocations: 13.86 KiB)
[ Info: NNlib.conv_im2col!():
  17.233 μs (38 allocations: 34.09 KiB)
[ Info: NNlib.conv_nnpack!():
  11.939 μs (17 allocations: 1.02 KiB)

Of course, because the first three methods listed above are very naively scheduled, as soon as the image being operated upon grows larger than our L2 cache, we fall way behind and the schedule-savvy methods pull ahead. Unfortunately, virtually all of these kinds of methods are designed to operate upon large batches and channel counts. Because we can't simulate FastConv and ImageFiltering in this case, you're going to just have to mentally multiply the following timings appropriately:

# (200x200 image, 7x7 kernel, still 1 channel and 1 batch)
[ Info: Old NNlib.conv_direct!():
  19.220 ms (212191 allocations: 24.77 MiB)
[ Info: New NNlib.conv_direct!():
  1.169 ms (22 allocations: 1.19 KiB)
[ Info: FastConv.convn!():
  1.143 ms (14 allocations: 322.88 KiB)
[ Info: ImageFiltering.imfilter():
  1.073 ms (270 allocations: 2.12 MiB)

Now let's get realistic channels and batch sizes:

# (200x200 image, 7x7 kernel, 3 input channels, 32 output channels, 16 batch size)
[ Info: Old NNlib.conv_direct!():
  15.599 s (108636172 allocations: 19.76 GiB)
[ Info: New NNlib.conv_direct!():
  2.631 s (22 allocations: 1.19 KiB)
[ Info: NNlib.conv_im2col!():
  98.307 ms (113 allocations: 23.80 MiB)
[ Info: NNlib.conv_nnpack!():
  94.436 ms (19 allocations: 19.36 KiB)

As you can see, the im2col and nnpack methods are doing good work, and although the new direct method is much faster, it is not a serious competitor at all once you hit realistic tensor sizes. Additionally, the fast methods here are growing their runtime at a much slower pace than the direct methods are. Finally, this PR enables multithreading during a critical section of the im2col() data copying portion, so by setting JULIA_NUM_THREADS, we can watch the im2col method start to beat even NNPACK:

# (200x200 image, 7x7 kernel, 3 input channels, 32 output channels, 16 batch size)
# JULIA_NUM_THREADS=2
[ Info: NNlib.conv_im2col!():
  69.073 ms (120 allocations: 23.80 MiB)
# JULIA_NUM_THREADS=4
[ Info: NNlib.conv_im2col!():
  61.430 ms (134 allocations: 23.81 MiB)
# JULIA_NUM_THREADS=8
[ Info: NNlib.conv_im2col!():
  58.184 ms (162 allocations: 23.81 MiB)

All these timings were done on my MBP (Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz), with Julia 1.3-rc4. If you want to be able to call NNlib.conv_direct_old!() note that you need to roll back to just before the last commit on this branch.

…nvolutions

Our approach is two-fold; use `calc_padding_regions()` to give us a
fast-path for the central part of a convolution, and also eliminate
allocations.  We also move a little bit more information into
compile-time.
This was caught by the fuzzing tests.
@codecov-io
Copy link

codecov-io commented Nov 14, 2019

Codecov Report

Merging #142 into master will decrease coverage by 2.86%.
The diff coverage is 97.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   77.73%   74.86%   -2.87%     
==========================================
  Files          24       24              
  Lines         768      768              
==========================================
- Hits          597      575      -22     
- Misses        171      193      +22
Impacted Files Coverage Δ
src/impl/conv_im2col.jl 98.88% <100%> (-0.03%) ⬇️
src/dim_helpers.jl 85.71% <100%> (+0.52%) ⬆️
src/dim_helpers/DenseConvDims.jl 100% <100%> (ø) ⬆️
src/dim_helpers/DepthwiseConvDims.jl 100% <100%> (ø) ⬆️
src/impl/conv_direct.jl 85.18% <100%> (-14.82%) ⬇️
src/impl/padding_edges.jl 100% <100%> (ø) ⬆️
src/impl/depthwiseconv_direct.jl 95.91% <93.1%> (-4.09%) ⬇️
src/nnpack/error.jl 4.41% <0%> (-17.65%) ⬇️
src/dim_helpers/ConvDims.jl 79.06% <0%> (-0.48%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30e61ef...3847a8c. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants