conv_direct!(): The performance fix #142

staticfloat · 2019-11-14T09:52:18Z

Jesse brought some bad performance characteristics of NNlib.conv_direct!() to my attention; in particular, our very small convolution was abysmal. Looking into it, it seems that it was actually not that big of a change necessary; just re-use some of the other infrastructure in this package (in particular, calc_padding_regions()), and eliminate some unnecessary allocations. When doing this on the kinds of tiny allocations that NNlib.conv_direct!() should excel at, we see some pretty drastic speed ups. Given the following testing harness, we can do some quick comparisons:

using NNlib, BenchmarkTools, FastConv, ImageFiltering

x = randn(Float32, 28, 28, 1, 1)
w = randn(Float32, 3, 3, 1, 1)
cdims = DenseConvDims(x, w; padding=(size(w)[1:end-2] .- 1))
y = zeros(Float32, NNlib.output_size(cdims)..., size(w)[end], size(x)[end])

NNlib.conv_direct!(y, x, w, cdims);

# This actually gives the _old() variant a small advantage!
oy, ox, ow, ocdims = NNlib.insert_singleton_spatial_dimension.((y, x, w, cdims))
@info("Old NNlib.conv_direct!():")
@btime NNlib.conv_direct_old!(oy, ox, ow, ocdims);
@info("New NNlib.conv_direct!():")
@btime NNlib.conv_direct!(y, x, w, cdims);
@info("FastConv.convn!():")
@btime FastConv.convn!(y[:,:,1,1], x[:,:,1,1], w[:,:,1,1])
@info("ImageFiltering.imfilter():")
@btime imfilter(x[:,:], centered(w[:,:]));
@info("NNlib.conv_im2col!():")
@btime NNlib.conv_im2col!(y, x, w, cdims);
@info("NNlib.conv_nnpack!():")
@btime NNlib.conv_nnpack!(y, x, w, cdims);

Running this with the given small sizes above we get the following timings:

# (28x28 image, 3x3 kernel, single channel and batch)
[ Info: Old NNlib.conv_direct!():
  345.572 μs (4511 allocations: 403.13 KiB)
[ Info: New NNlib.conv_direct!():
  5.226 μs (22 allocations: 1.19 KiB)
[ Info: FastConv.convn!():
  8.838 μs (12 allocations: 7.38 KiB)
[ Info: ImageFiltering.imfilter():
  10.513 μs (43 allocations: 13.86 KiB)
[ Info: NNlib.conv_im2col!():
  17.233 μs (38 allocations: 34.09 KiB)
[ Info: NNlib.conv_nnpack!():
  11.939 μs (17 allocations: 1.02 KiB)

Of course, because the first three methods listed above are very naively scheduled, as soon as the image being operated upon grows larger than our L2 cache, we fall way behind and the schedule-savvy methods pull ahead. Unfortunately, virtually all of these kinds of methods are designed to operate upon large batches and channel counts. Because we can't simulate FastConv and ImageFiltering in this case, you're going to just have to mentally multiply the following timings appropriately:

# (200x200 image, 7x7 kernel, still 1 channel and 1 batch)
[ Info: Old NNlib.conv_direct!():
  19.220 ms (212191 allocations: 24.77 MiB)
[ Info: New NNlib.conv_direct!():
  1.169 ms (22 allocations: 1.19 KiB)
[ Info: FastConv.convn!():
  1.143 ms (14 allocations: 322.88 KiB)
[ Info: ImageFiltering.imfilter():
  1.073 ms (270 allocations: 2.12 MiB)

Now let's get realistic channels and batch sizes:

# (200x200 image, 7x7 kernel, 3 input channels, 32 output channels, 16 batch size)
[ Info: Old NNlib.conv_direct!():
  15.599 s (108636172 allocations: 19.76 GiB)
[ Info: New NNlib.conv_direct!():
  2.631 s (22 allocations: 1.19 KiB)
[ Info: NNlib.conv_im2col!():
  98.307 ms (113 allocations: 23.80 MiB)
[ Info: NNlib.conv_nnpack!():
  94.436 ms (19 allocations: 19.36 KiB)

As you can see, the im2col and nnpack methods are doing good work, and although the new direct method is much faster, it is not a serious competitor at all once you hit realistic tensor sizes. Additionally, the fast methods here are growing their runtime at a much slower pace than the direct methods are. Finally, this PR enables multithreading during a critical section of the im2col() data copying portion, so by setting JULIA_NUM_THREADS, we can watch the im2col method start to beat even NNPACK:

# (200x200 image, 7x7 kernel, 3 input channels, 32 output channels, 16 batch size)
# JULIA_NUM_THREADS=2
[ Info: NNlib.conv_im2col!():
  69.073 ms (120 allocations: 23.80 MiB)
# JULIA_NUM_THREADS=4
[ Info: NNlib.conv_im2col!():
  61.430 ms (134 allocations: 23.81 MiB)
# JULIA_NUM_THREADS=8
[ Info: NNlib.conv_im2col!():
  58.184 ms (162 allocations: 23.81 MiB)

All these timings were done on my MBP (Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz), with Julia 1.3-rc4. If you want to be able to call NNlib.conv_direct_old!() note that you need to roll back to just before the last commit on this branch.

…nvolutions Our approach is two-fold; use `calc_padding_regions()` to give us a fast-path for the central part of a convolution, and also eliminate allocations. We also move a little bit more information into compile-time.

This was caught by the fuzzing tests.

codecov-io · 2019-11-14T10:05:44Z

Codecov Report

Merging #142 into master will decrease coverage by 2.86%.
The diff coverage is 97.7%.

@@            Coverage Diff             @@
##           master     #142      +/-   ##
==========================================
- Coverage   77.73%   74.86%   -2.87%     
==========================================
  Files          24       24              
  Lines         768      768              
==========================================
- Hits          597      575      -22     
- Misses        171      193      +22

Impacted Files	Coverage Δ
src/impl/conv_im2col.jl	`98.88% <100%> (-0.03%)`	⬇️
src/dim_helpers.jl	`85.71% <100%> (+0.52%)`	⬆️
src/dim_helpers/DenseConvDims.jl	`100% <100%> (ø)`	⬆️
src/dim_helpers/DepthwiseConvDims.jl	`100% <100%> (ø)`	⬆️
src/impl/conv_direct.jl	`85.18% <100%> (-14.82%)`	⬇️
src/impl/padding_edges.jl	`100% <100%> (ø)`	⬆️
src/impl/depthwiseconv_direct.jl	`95.91% <93.1%> (-4.09%)`	⬇️
src/nnpack/error.jl	`4.41% <0%> (-17.65%)`	⬇️
src/dim_helpers/ConvDims.jl	`79.06% <0%> (-0.48%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30e61ef...3847a8c. Read the comment docs.

staticfloat added 3 commits November 14, 2019 01:51

Fix negative padding breaking calc_padding_regions()

6fd47e0

This was caught by the fuzzing tests.

Remove _old!() methods

3847a8c

staticfloat mentioned this pull request Nov 14, 2019

Fast Convolutions and Performance in NNlib #139

Open

arhik mentioned this pull request Nov 30, 2019

An attempt to combine dense, depthwise and groupwise conv through DenseConvDims #146

Open

1 task

staticfloat merged commit 136aa82 into master Nov 30, 2019

staticfloat deleted the sf/conv_direct_performance branch November 30, 2019 21:25

ChrisRackauckas mentioned this pull request Dec 23, 2019

Make the NNlib tests more robust #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

conv_direct!(): The performance fix #142

conv_direct!(): The performance fix #142

Uh oh!

staticfloat commented Nov 14, 2019

Uh oh!

codecov-io commented Nov 14, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

conv_direct!(): The performance fix #142

conv_direct!(): The performance fix #142

Uh oh!

Conversation

staticfloat commented Nov 14, 2019

Uh oh!

codecov-io commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov-io commented Nov 14, 2019 •

edited

Loading