-
Notifications
You must be signed in to change notification settings - Fork 244
New high level interface for cuDNN #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@maleadt I need some advice on https://github.com/denizyuret/Knet.jl/blob/2dec7d58a9d40899c5c38d6f4f17af463163409f/src/cudnn/dropout.jl#L66-L88 As far as I can understand from cudnn docs, |
If you rebase this on |
I occasionally merge master. I don't know how/why to rebase. |
|
I was able to do a successful |
lib/cudnn/nnlib.jl
Outdated
algo=CUDNN_SOFTMAX_FAST, mode=cudnnSoftmaxMode_t(dims-1)) | ||
return y | ||
mode, xsize = softmaxhelper(size(x), dims) | ||
cudnnSoftmaxForward!(reshape(y,xsize), reshape(x,xsize); algo=CUDNN_SOFTMAX_FAST, mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this always use fast mode? Would it be possible to respect the default math mode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does right now, but I agree it should be consistent. I am not sure what the "default math mode" should be here: there is a CUDNN.math_mode, but that's about tensor optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA.math_mode
is a more generic property, which CUDNN.math_mode
copies for handle creation. But the former should be used for other purposes:
CUDA.jl/lib/cublas/wrappers.jl
Lines 757 to 763 in d16a4b3
math_mode = CUDA.math_mode() | |
reduced_precision = CUDA.math_precision() | |
if sig === (Float16, Float16) | |
# NOTE: Float16=Float16*Float16 can also happen in 32-bit compute | |
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_16F_PEDANTIC : CUBLAS_COMPUTE_16F | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have the current math mode as the default and have libraries request the fast paths where possible.
Codecov Report
@@ Coverage Diff @@
## master #523 +/- ##
==========================================
+ Coverage 77.91% 79.79% +1.88%
==========================================
Files 118 122 +4
Lines 7117 7329 +212
==========================================
+ Hits 5545 5848 +303
+ Misses 1572 1481 -91
Continue to review full report at Codecov.
|
CI failure is likely a regression in Julia fixed with LLVM#master. |
Rebasing on master should fix that. |
@maleadt I think this PR is ready to merge. We can meet up to take care of squash/rebase etc. if you'd like. |
I've been playing with the PR for a couple days now, and I was noticing some performance regressions training with several models, but resnet shows it fairly simply Current:
This:
This is with an input of |
We'll probably want to sync releases and compat accordingly. |
@denizyuret what are your thoughts on using the RNN primitives from CUDNN vs naive Julia ones. Some frameworks opt to not use the ones in CUDNN because of poorer performance |
@DhairyaLGandhi, do you have any examples where CUDNN has poorer performance than an alternative implementation of RNN? In my experience a manually programmed RNN (e.g. https://github.com/denizyuret/Knet.jl/blob/master/tutorial/90.s2s.ipynb) is significantly slower than a CUDNN call (e.g. https://github.com/denizyuret/Knet.jl/blob/master/tutorial/60.rnn.ipynb). This is expected because CUDNN parallelizes certain operations (e.g. matmul of the whole input sequence rather than individual time steps) that cannot be done with a single time-step implementation. |
# Allocate the maximum reasonable amount of memory for algorithm discovery | ||
function cudnnFindConvolutionAlgorithmWorkspaceSize(x) | ||
gpufree = Mem.info()[1] + CUDA.cached_memory() | ||
min(gpufree ÷ 10, sizeof(x) * 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 seems a bit aggressive, it might mean we start going to slower algos as we get more memory constrained, when tensors should have been freed. Maybe @maleadt can clarify if my thinking is incorrect here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mem.available_memory()
instead of Mem.info()[1]
.
100x seems aggressive indeed; is there any guide in the documentation, or how did you come up with that?
Still noticing some slowness; # Release
julia> benchmark_bw_cu(64)
BenchmarkTools.Trial:
memory estimate: 2.23 MiB
allocs estimate: 48336
--------------
minimum time: 552.016 ms (1.72% GC)
median time: 601.561 ms (1.44% GC)
mean time: 598.092 ms (1.45% GC)
maximum time: 637.229 ms (1.47% GC)
--------------
samples: 4
evals/sample: 1
# This PR
julia> benchmark_bw_cu(64)
BenchmarkTools.Trial:
memory estimate: 3.58 MiB
allocs estimate: 57069
--------------
minimum time: 730.687 ms (2.78% GC)
median time: 753.419 ms (5.19% GC)
mean time: 809.694 ms (10.59% GC)
maximum time: 944.977 ms (20.92% GC)
--------------
samples: 3
evals/sample: 1 And I am OOMing at higher batch sizes, but I haven't tried TF to see how they do with a 32 GB V100 |
There's some conflicts because of #672. |
I will take a look. @DhairyaLGandhi when can we merge this to avoid future conflicts? |
Let's go ahead and merge this once the conflicts are resolved. |
Yes, we were discussing it for a couple days. We will need to change some nnlib APIs and get 3D recurrent layers with cudnn, and look into some regressions, but this is the base to grow from. |
@DhairyaLGandhi @maleadt, I checked the conflict with the #672 merge. This is practically impossible to merge with #523 -- it would be easier for me to rewrite whatever functionality was intended. I spent a month on #523 redesigning all of lib/cudnn and test/cudnn. #672 used the master versions, e.g. the test file it is using has been moved, renamed and redesigned, nnlib.jl has been redesigned. If all that is needed is adding a |
That should be fine, we could also revert #672, and rebase the functionality over on top of this. Whatever would be easier I think. |
I can add alpha/beta keyword args in CUDA/lib/cudnn/nnlib.jl in this PR. I assume these have the same semantics as cuDNN and the default is alpha=1, beta=0. @DhairyaLGandhi you mentioned that these were supported in NNlib.jl but I did not see them in the master, am I looking at the wrong branch or did I misunderstand what #672 is trying to do? |
Yes, these are implemented here https://github.com/FluxML/NNlib.jl/blob/d9aaaf7ac4df683f1d3361efb3cbaef9aa8112a1/src/impl/conv_im2col.jl#L25 And the defaults are consistent with CuDNN |
@DhairyaLGandhi I added explicit alpha/beta kwargs to all NNlib.conv related functions in this branch (lib/cudnn/nnlib.jl) along with associated tests in (test/cudnn/nnlib.jl). All tests pass except for At this point we need to decide the semantics of alpha/beta in backward functions If you choose (2) and I remember correctly: Right now NNlib seems inconsistent in this regard: backward-filter adds when beta is non-zero, backward-data doesn't. I also noticed that conv_bias_act! does not use alpha/beta consistently in nnlib. In any case, these are NNlib issues that can be addressed later. Can we revert #672 and merge this PR? |
I'll revert/rebase/push, we can merge this once it passes a final round of CI. |
Yeah, I'm thinking that I think |
@denizyuret Why did you push a merge commit? This was ready to go. |
@maleadt sorry about that, it was automatic reflex (open laptop, git pull, merge if new things). How can I fix it? |
I'll drop the merge commit. |
15565f0
to
0b57f52
Compare
New high level interface for cuDNN
New high level interface for cuDNN
Please see README.md for some design decisions, comments welcome. I tried not to break any existing code by defining deprecated methods. Here is a todo list.
cudnnDataType()
. Replacec_null(x),cu_null(x)->something(x, C/CU_NULL)
,cudnnWorkspace->@workspace/cudnnTempSpace
,scalr -> scalingParameter
,@retry -> @retry_reclaim
,DevArray->CuArray
.@cudnnDescriptor
macro.tensor.jl (old): defines TensorDesc, add, op, reduce wrappers. These are defined in separate files in my implementation.filter.jl: merged with tensor.jlconv.jl-> convolution.jl: including conv-bias-act: removed NNlib dependency. testedbatchnorm.jl-> normalization.jl: testedTODO: rethink arg order: adopted one from tf.nnbatchnorm needs to be deprecated and implemented in terms of normalization.jl: leaving this to nnlib.TODO: profile and pick faster version of softmax.compat.jl-> tested merged into nnlib.jl