Skip to content

New high level interface for cuDNN #523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 2, 2021
Merged

Conversation

denizyuret
Copy link
Contributor

@denizyuret denizyuret commented Nov 6, 2020

Please see README.md for some design decisions, comments welcome. I tried not to break any existing code by defining deprecated methods. Here is a todo list.

  • README: Describes design choices
  • util.jl: Completed cudnnDataType(). Replace c_null(x),cu_null(x)->something(x, C/CU_NULL) , cudnnWorkspace->@workspace/cudnnTempSpace, scalr -> scalingParameter, @retry -> @retry_reclaim, DevArray->CuArray.
  • base.jl
  • descriptors.jl: defines all descriptors consistently using @cudnnDescriptor macro.
  • tensor.jl (old): defines TensorDesc, add, op, reduce wrappers. These are defined in separate files in my implementation.
  • filter.jl: merged with tensor.jl
  • tensor.jl: alternative array based constructors for Tensor and Filter. Define compat function with old TensorDesc constructor. tested
  • inplace.jl: set, scale, add. tested
  • optensor.jl: tested
  • reduce.jl: tested
  • activation.jl: tested
  • softmax.jl: tested
  • pooling.jl: removed NNlib dependency. tested
  • dropout.jl: tested
  • conv.jl -> convolution.jl: including conv-bias-act: removed NNlib dependency. tested
  • rnn.jl: new interface done. old interface complicated and incomplete, I am skipping deprecated methods for this one. tested
  • multiheadattn.jl: tested
  • batchnorm.jl -> normalization.jl: tested TODO: rethink arg order: adopted one from tf.nn batchnorm needs to be deprecated and implemented in terms of normalization.jl: leaving this to nnlib.
  • nnlib.jl: tested. should rename test file test/cudnn.jl -> test/cudnn/nnlib.jl. TODO: profile and pick faster version of softmax.
  • compat.jl -> tested merged into nnlib.jl
  • CUDNN.jl
  • docs
  • tests: julia17 --project -e 'using Pkg; Pkg.API.test(; test_args=`--memcheck --jobs=1 cudnn`)
  • deprecations
  • profiling

@maleadt maleadt marked this pull request as draft November 6, 2020 13:09
@maleadt maleadt added cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request labels Nov 6, 2020
@denizyuret denizyuret self-assigned this Nov 7, 2020
@denizyuret
Copy link
Contributor Author

@maleadt I need some advice on https://github.com/denizyuret/Knet.jl/blob/2dec7d58a9d40899c5c38d6f4f17af463163409f/src/cudnn/dropout.jl#L66-L88

As far as I can understand from cudnn docs, cudnnDropoutState should be similar to Random.GLOBAL_RNG. There is this warning about concurrently running dropouts using the same state. I don't know if this concerns threads on the CPU or streams on the GPU or both. How should we handle this? Did you have a similar issue with CURAND?

@maleadt
Copy link
Member

maleadt commented Nov 10, 2020

If you rebase this on master, CI will work.

@denizyuret
Copy link
Contributor Author

If you rebase this on master, CI will work.

I occasionally merge master. I don't know how/why to rebase.

@maleadt
Copy link
Member

maleadt commented Nov 10, 2020

I occasionally merge master. I don't know how/why to rebase.

git rebase master (after pulling it, else git fetch && git rebase origin/master). Better do so, to avoid ugly merge commits.

@denizyuret
Copy link
Contributor Author

I occasionally merge master. I don't know how/why to rebase.

git rebase master (after pulling it, else git fetch && git rebase origin/master). Better do so, to avoid ugly merge commits.

I was able to do a successful git rebase master once but now I get scary conflict messages when I try. We'll clean up the history when it is time to review.

algo=CUDNN_SOFTMAX_FAST, mode=cudnnSoftmaxMode_t(dims-1))
return y
mode, xsize = softmaxhelper(size(x), dims)
cudnnSoftmaxForward!(reshape(y,xsize), reshape(x,xsize); algo=CUDNN_SOFTMAX_FAST, mode)
Copy link

@simeonschaub simeonschaub Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always use fast mode? Would it be possible to respect the default math mode here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does right now, but I agree it should be consistent. I am not sure what the "default math mode" should be here: there is a CUDNN.math_mode, but that's about tensor optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.math_mode is a more generic property, which CUDNN.math_mode copies for handle creation. But the former should be used for other purposes:

math_mode = CUDA.math_mode()
reduced_precision = CUDA.math_precision()
if sig === (Float16, Float16)
# NOTE: Float16=Float16*Float16 can also happen in 32-bit compute
return math_mode==CUDA.PEDANTIC_MATH ? CUBLAS_COMPUTE_16F_PEDANTIC : CUBLAS_COMPUTE_16F
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have the current math mode as the default and have libraries request the fast paths where possible.

@codecov
Copy link

codecov bot commented Nov 20, 2020

Codecov Report

Merging #523 (3d016ba) into master (df9b4df) will increase coverage by 1.88%.
The diff coverage is 73.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
+ Coverage   77.91%   79.79%   +1.88%     
==========================================
  Files         118      122       +4     
  Lines        7117     7329     +212     
==========================================
+ Hits         5545     5848     +303     
+ Misses       1572     1481      -91     
Impacted Files Coverage Δ
lib/cudnn/CUDNN.jl 66.03% <ø> (ø)
lib/cudnn/tensor.jl 34.48% <34.48%> (-22.67%) ⬇️
lib/cudnn/batchnorm.jl 35.29% <42.85%> (ø)
lib/cudnn/util.jl 44.44% <50.00%> (-8.50%) ⬇️
lib/cudnn/reduce.jl 59.25% <59.25%> (ø)
lib/cudnn/pooling.jl 63.33% <62.06%> (-32.32%) ⬇️
lib/cudnn/softmax.jl 69.23% <66.66%> (-30.77%) ⬇️
lib/cudnn/activation.jl 71.42% <71.42%> (-22.33%) ⬇️
lib/cudnn/rnn.jl 71.92% <71.92%> (+71.92%) ⬆️
lib/cudnn/nnlib.jl 72.94% <72.22%> (+3.51%) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df9b4df...0b57f52. Read the comment docs.

@maleadt
Copy link
Member

maleadt commented Dec 9, 2020

CI failure is likely a regression in Julia fixed with LLVM#master.

@maleadt
Copy link
Member

maleadt commented Dec 9, 2020

Rebasing on master should fix that.

@denizyuret
Copy link
Contributor Author

@maleadt I think this PR is ready to merge. We can meet up to take care of squash/rebase etc. if you'd like.

@maleadt maleadt marked this pull request as ready for review December 17, 2020 14:51
@maleadt maleadt changed the title dy/cudnn: New high level interface for cuDNN [WIP] New high level interface for cuDNN Dec 17, 2020
@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Dec 23, 2020

I've been playing with the PR for a couple days now, and I was noticing some performance regressions training with several models, but resnet shows it fairly simply

Current:

BenchmarkTools.Trial:
  memory estimate:  676.37 KiB
  allocs estimate:  22051
  --------------
  minimum time:     413.535 ms (0.99% GC)
  median time:      496.596 ms (0.82% GC)
  mean time:        571.457 ms (0.78% GC)
  maximum time:     952.702 ms (0.63% GC)
  --------------
  samples:          5
  evals/sample:     1

This:

BenchmarkTools.Trial:
  memory estimate:  348.56 KiB
  allocs estimate:  9584
  --------------
  minimum time:     619.104 ms (0.00% GC)
  median time:      685.281 ms (0.00% GC)
  mean time:        868.306 ms (0.00% GC)
  maximum time:     1.484 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

This is with an input of (224,224,3,128) using Resnet from Metalhead.jl. The problem seems to rise with batch sizes, although in some cudnn bound cases, I see smaller batch outperform the current 🎉

@DhairyaLGandhi
Copy link
Member

We'll probably want to sync releases and compat accordingly.

@DhairyaLGandhi
Copy link
Member

@denizyuret what are your thoughts on using the RNN primitives from CUDNN vs naive Julia ones. Some frameworks opt to not use the ones in CUDNN because of poorer performance

@denizyuret
Copy link
Contributor Author

@denizyuret what are your thoughts on using the RNN primitives from CUDNN vs naive Julia ones. Some frameworks opt to not use the ones in CUDNN because of poorer performance

@DhairyaLGandhi, do you have any examples where CUDNN has poorer performance than an alternative implementation of RNN? In my experience a manually programmed RNN (e.g. https://github.com/denizyuret/Knet.jl/blob/master/tutorial/90.s2s.ipynb) is significantly slower than a CUDNN call (e.g. https://github.com/denizyuret/Knet.jl/blob/master/tutorial/60.rnn.ipynb).

This is expected because CUDNN parallelizes certain operations (e.g. matmul of the whole input sequence rather than individual time steps) that cannot be done with a single time-step implementation.

# Allocate the maximum reasonable amount of memory for algorithm discovery
function cudnnFindConvolutionAlgorithmWorkspaceSize(x)
gpufree = Mem.info()[1] + CUDA.cached_memory()
min(gpufree ÷ 10, sizeof(x) * 100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 seems a bit aggressive, it might mean we start going to slower algos as we get more memory constrained, when tensors should have been freed. Maybe @maleadt can clarify if my thinking is incorrect here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mem.available_memory() instead of Mem.info()[1].

100x seems aggressive indeed; is there any guide in the documentation, or how did you come up with that?

@DhairyaLGandhi
Copy link
Member

Still noticing some slowness;

# Release
julia> benchmark_bw_cu(64)
BenchmarkTools.Trial:
  memory estimate:  2.23 MiB
  allocs estimate:  48336
  --------------
  minimum time:     552.016 ms (1.72% GC)
  median time:      601.561 ms (1.44% GC)
  mean time:        598.092 ms (1.45% GC)
  maximum time:     637.229 ms (1.47% GC)
  --------------
  samples:          4
  evals/sample:     1

# This PR
julia> benchmark_bw_cu(64)
BenchmarkTools.Trial:
  memory estimate:  3.58 MiB
  allocs estimate:  57069
  --------------
  minimum time:     730.687 ms (2.78% GC)
  median time:      753.419 ms (5.19% GC)
  mean time:        809.694 ms (10.59% GC)
  maximum time:     944.977 ms (20.92% GC)
  --------------
  samples:          3
  evals/sample:     1

And I am OOMing at higher batch sizes, but I haven't tried TF to see how they do with a 32 GB V100

@maleadt
Copy link
Member

maleadt commented Jan 27, 2021

There's some conflicts because of #672.

@denizyuret
Copy link
Contributor Author

I will take a look. @DhairyaLGandhi when can we merge this to avoid future conflicts?

@maleadt
Copy link
Member

maleadt commented Jan 28, 2021

Let's go ahead and merge this once the conflicts are resolved.

@DhairyaLGandhi
Copy link
Member

Yes, we were discussing it for a couple days. We will need to change some nnlib APIs and get 3D recurrent layers with cudnn, and look into some regressions, but this is the base to grow from.

@denizyuret
Copy link
Contributor Author

@DhairyaLGandhi @maleadt, I checked the conflict with the #672 merge. This is practically impossible to merge with #523 -- it would be easier for me to rewrite whatever functionality was intended. I spent a month on #523 redesigning all of lib/cudnn and test/cudnn. #672 used the master versions, e.g. the test file it is using has been moved, renamed and redesigned, nnlib.jl has been redesigned. If all that is needed is adding a beta keyword to the nnlib conv functions why don't I just add this to #523 instead of trying to merge?

@DhairyaLGandhi
Copy link
Member

That should be fine, we could also revert #672, and rebase the functionality over on top of this. Whatever would be easier I think.

@denizyuret
Copy link
Contributor Author

I can add alpha/beta keyword args in CUDA/lib/cudnn/nnlib.jl in this PR. I assume these have the same semantics as cuDNN and the default is alpha=1, beta=0. @DhairyaLGandhi you mentioned that these were supported in NNlib.jl but I did not see them in the master, am I looking at the wrong branch or did I misunderstand what #672 is trying to do?

@DhairyaLGandhi
Copy link
Member

Yes, these are implemented here https://github.com/FluxML/NNlib.jl/blob/d9aaaf7ac4df683f1d3361efb3cbaef9aa8112a1/src/impl/conv_im2col.jl#L25

And the defaults are consistent with CuDNN

@denizyuret
Copy link
Contributor Author

@DhairyaLGandhi I added explicit alpha/beta kwargs to all NNlib.conv related functions in this branch (lib/cudnn/nnlib.jl) along with associated tests in (test/cudnn/nnlib.jl). All tests pass except for NNlib.∇conv_data! with nonzero beta. The cpu version does nothing, the gpu version adds the output.

At this point we need to decide the semantics of alpha/beta in backward functions NNlib.∇conv*: there are two possibilities:
(1) They simply scale / add to the output as in the forward call and are independent of whether the forward call had alpha/beta.
(2) They are assumed to be the alpha/beta from the forward call and do the right thing for the gradient.

If you choose (2) and I remember correctly:
(a) both backward-filter and backward-data should use beta=0, alpha=forward-alpha.
(b) backward-bias should use alpha=1, beta=0 regardless of the forward params.
(c) Finally the gradient for the residual (the cudnn z argument, or the original y argument that got added to) should multiply dy by beta.

Right now NNlib seems inconsistent in this regard: backward-filter adds when beta is non-zero, backward-data doesn't.

I also noticed that conv_bias_act! does not use alpha/beta consistently in nnlib.

In any case, these are NNlib issues that can be addressed later. Can we revert #672 and merge this PR?

@maleadt
Copy link
Member

maleadt commented Feb 1, 2021

I'll revert/rebase/push, we can merge this once it passes a final round of CI.

@DhairyaLGandhi
Copy link
Member

Yeah, I'm thinking that conv_bias_ act needs to be compatible with conv.

I think alpha and beta should be consistent with the cpu case. Do you think using the cudnn default as in 2 should be triggered only when these conditions are met?

@maleadt
Copy link
Member

maleadt commented Feb 2, 2021

@denizyuret Why did you push a merge commit? This was ready to go.

@denizyuret
Copy link
Contributor Author

@maleadt sorry about that, it was automatic reflex (open laptop, git pull, merge if new things). How can I fix it?

@maleadt
Copy link
Member

maleadt commented Feb 2, 2021

I'll drop the merge commit.

@maleadt maleadt force-pushed the dy/cudnn branch 2 times, most recently from 15565f0 to 0b57f52 Compare February 2, 2021 08:22
@maleadt maleadt merged commit f1caf8d into JuliaGPU:master Feb 2, 2021
maleadt added a commit that referenced this pull request Feb 10, 2021
New high level interface for cuDNN
maleadt added a commit that referenced this pull request Mar 16, 2021
New high level interface for cuDNN
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

softmax has problem with dim parameter softmax(x) and logsoftmax(x) update their arguments Incomplete CUDNN wrappers dims support for softmax
6 participants