conv_direct!(): The performance fix #142
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jesse brought some bad performance characteristics of NNlib.conv_direct!() to my attention; in particular, our very small convolution was abysmal. Looking into it, it seems that it was actually not that big of a change necessary; just re-use some of the other infrastructure in this package (in particular,
calc_padding_regions()
), and eliminate some unnecessary allocations. When doing this on the kinds of tiny allocations thatNNlib.conv_direct!()
should excel at, we see some pretty drastic speed ups. Given the following testing harness, we can do some quick comparisons:Running this with the given small sizes above we get the following timings:
Of course, because the first three methods listed above are very naively scheduled, as soon as the image being operated upon grows larger than our L2 cache, we fall way behind and the schedule-savvy methods pull ahead. Unfortunately, virtually all of these kinds of methods are designed to operate upon large batches and channel counts. Because we can't simulate
FastConv
andImageFiltering
in this case, you're going to just have to mentally multiply the following timings appropriately:Now let's get realistic channels and batch sizes:
As you can see, the im2col and nnpack methods are doing good work, and although the new direct method is much faster, it is not a serious competitor at all once you hit realistic tensor sizes. Additionally, the fast methods here are growing their runtime at a much slower pace than the direct methods are. Finally, this PR enables multithreading during a critical section of the
im2col()
data copying portion, so by settingJULIA_NUM_THREADS
, we can watch theim2col
method start to beat even NNPACK:All these timings were done on my MBP (
Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
), with Julia 1.3-rc4. If you want to be able to callNNlib.conv_direct_old!()
note that you need to roll back to just before the last commit on this branch.