Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport new CUDNN interface to 1.5 #635

Closed
wants to merge 1 commit into from
Closed

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jan 11, 2021

This should be easier to test with current packages. Fixes should be pushed to #523. If this PR is fine, we can go ahead and merge #523 too. cc @denizyuret @DhairyaLGandhi @CarloLucibello

@maleadt maleadt added enhancement New feature or request cuda libraries Stuff about CUDA library wrappers. labels Jan 11, 2021
@codecov
Copy link

codecov bot commented Jan 14, 2021

Codecov Report

Merging #635 (80c846f) into compat (50815c4) will increase coverage by 1.64%.
The diff coverage is 65.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           compat     #635      +/-   ##
==========================================
+ Coverage   77.30%   78.94%   +1.64%     
==========================================
  Files         116      120       +4     
  Lines        6438     6650     +212     
==========================================
+ Hits         4977     5250     +273     
+ Misses       1461     1400      -61     
Impacted Files Coverage Δ
lib/cudnn/CUDNN.jl 42.85% <ø> (ø)
lib/cudnn/tensor.jl 31.03% <31.03%> (-19.99%) ⬇️
lib/cudnn/util.jl 40.74% <40.00%> (-6.32%) ⬇️
lib/cudnn/batchnorm.jl 33.33% <42.85%> (ø)
lib/cudnn/pooling.jl 53.33% <51.72%> (-24.93%) ⬇️
lib/cudnn/reduce.jl 51.85% <51.85%> (ø)
lib/cudnn/nnlib.jl 57.05% <57.40%> (-6.58%) ⬇️
lib/cudnn/softmax.jl 61.53% <58.33%> (-5.13%) ⬇️
lib/cudnn/activation.jl 64.28% <64.28%> (-16.97%) ⬇️
lib/cudnn/rnn.jl 69.29% <69.29%> (+69.29%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50815c4...80c846f. Read the comment docs.

@CarloLucibello
Copy link
Contributor

Last tagged Flux's tests fail on this branch:

ERROR: LoadError: LoadError: LoadError: UndefVarError: RNNDesc not defined
Stacktrace:
 [1] top-level scope at /home/carlo/.julia/packages/Flux/YO4zY/src/cuda/curnn.jl:8
 [2] include at ./Base.jl:368 [inlined]
 [3] include(::String) at /home/carlo/.julia/packages/Flux/YO4zY/src/cuda/cuda.jl:1
 [4] top-level scope at /home/carlo/.julia/packages/Flux/YO4zY/src/cuda/cuda.jl:6
 [5] include at ./Base.jl:368 [inlined]
 [6] include(::String) at /home/carlo/.julia/packages/Flux/YO4zY/src/Flux.jl:1
 [7] top-level scope at /home/carlo/.julia/packages/Flux/YO4zY/src/Flux.jl:51
 [8] top-level scope at none:2
 [9] eval at ./boot.jl:331 [inlined]
in expression starting at /home/carlo/.julia/packages/Flux/YO4zY/src/cuda/curnn.jl:8
in expression starting at /home/carlo/.julia/packages/Flux/YO4zY/src/cuda/cuda.jl:6
in expression starting at /home/carlo/.julia/packages/Flux/YO4zY/src/Flux.jl:51
ERROR: LoadError: Failed to precompile Flux [587475ba-b771-5e3f-ad9e-33799f191a9c] to /home/carlo/.julia/compiled/v1.5/Flux/QdkVy_aa1Cd.ji.
Stacktrace:
 [1] include(::String) at ./client.jl:457
 [2] top-level scope at none:6
in expression starting at /home/carlo/.julia/packages/Flux/YO4zY/test/runtests.jl:1
ERROR: Package Flux errored during testing

because we used to rely on the RNNDesc:

function CUDNN.RNNDesc(m::CuRNNs{T}) where T
  h, i = length(m.h), size(m.Wi, 2)
  mode = m isa CuRNN ?
    (m.σ == tanh ? CUDNN.CUDNN_RNN_TANH : CUDNN.CUDNN_RNN_RELU) :
    m isa CuGRU ? CUDNN.CUDNN_GRU : CUDNN.CUDNN_LSTM
  r = CUDNN.RNNDesc{T}(mode, i, h)
  return r
end

const descs = WeakKeyDict()

function desc(rnn)
  d = haskey(descs, rnn) ? descs[rnn] : (descs[rnn] = CUDNN.RNNDesc(rnn))
  CUDNN.setweights!(d, rnn.Wi, rnn.Wh, rnn.b)
  return d
end

function (m::CuRNN{T})(h::CuArray{T}, x::CuArray{T}) where T <: Union{Float32,Float64}
  y, h′ = CUDNN.forward(desc(m), x, h)
  return h′, y
end

This code was likely removed by @jeremiedb and tests are passing on Flux's master.
@denizyuret can this be addressed?

@denizyuret
Copy link
Contributor

because we used to rely on the RNNDesc:

One of the things that prompted me to write the new cudnn interface was the incomplete rnn interface: #343. RNNDesc was a badly designed interface that did not support many of the functionalities of cudnn rnn API (no multiple layers, bidirectional networks etc if I understand correctly). The new call cudnnRNNForward supports the cudnn api fully. A new interface based on cudnnRNNForward can be implemented in Flux/NNlib. For now the old RNNDesc code can be revived in Flux if needed.

@CarloLucibello
Copy link
Contributor

Can a compatibility layer between the old RNNDesc based interface and the new one be kept in CUDA.jl, at least here in the backport? Otherwise, the next julia-1.5 compatible release will have to be marked as breaking

@denizyuret
Copy link
Contributor

Certainly. I can try to write something next week if nobody else gets to it before me.

@jeremiedb
Copy link

jeremiedb commented Jan 16, 2021

Regarding the RNN design, I'm unclear whether the plugging back the CUDNN would currently be right course given the differently designed approach taken in Flux, which only defines recurrent cells at the individual level. This results in a model being broadcasting on a sequence m.(x) rather than taking the full 3D input as expected by CUDNN.

So under current Flux approach to RNN, the CUDNN functionalities were not really leveraged as it was called for each individual step of the sequence, which explained from my understanding why resorting to base CUDA.jl implementation resulted in similar performance (along solving some pending issues).

All this to say that while I see as desirable to have an approach to RNN that matches that of CUDNN with 3D input, I think it would involve some departure from the "single step" design. Otherwise, I'm not sure there's any benefit integrating the CUDNN RNN if the Flux RNN design isn't moved to that 3D structure. @denizyuret, was your intent to revamp Flux RNN in that direction?

@denizyuret
Copy link
Contributor

Regarding the RNN design, I'm unclear whether the plugging back the CUDNN would currently be right course given the differently designed approach taken in Flux, which only defines recurrent cells at the individual level. This results in a model being broadcasting on a sequence m.(x) rather than taking the full 3D input as expected by CUDNN.

That's interesting. Though a more appropriate Julia metaphor for RNNs would be not map but reduce would it not? There is a changing state every time step and at the end you usually want to reduce a sequence (of vectors) to a single vector. In applications where the intermediate steps matter, these can be thought of akin to e.g. cumulative sums.

So under current Flux approach to RNN, the CUDNN functionalities were not really leveraged as it was called for each individual step of the sequence, which explained from my understanding why resorting to base CUDA.jl implementation resulted in similar performance (along solving some pending issues).

This is not surprising because most of the gain for a GPU rnn comes from being able to process the whole sequence in parallel.

All this to say that while I see as desirable to have an approach to RNN that matches that of CUDNN with 3D input, I think it would involve some departure from the "single step" design. Otherwise, I'm not sure there's any benefit integrating the CUDNN RNN if the Flux RNN design isn't moved to that 3D structure. @denizyuret, was your intent to revamp Flux RNN in that direction?

I think for research use it is essential to support cudnn features like 3D input, multiple layers, easy bidirectionality etc. A "single step" design is useful for demonstration and teaching but not for most large scale research. NNlib should probably extended with such an interface that also has the supporting data structures like PaddedSequence etc. Flux can choose to support both the old and the new interfaces. As to how to proceed in Flux I am not sure how the decision process works. Should @DhairyaLGandhi or @CarloLucibello have the final say?

As an example I have had an RNN interface in Knet that supports 3D, multilayer, bidirectional etc. since cudnn introduced them:

https://github.com/denizyuret/Knet.jl/blob/master/src/ops20/rnn.jl

Though nobody is completely happy with my interface, @ekinakyurek wrote another:

https://github.com/ekinakyurek/KnetLayers.jl/blob/master/src/rnn.jl

These days I am looking at PyTorch/TensorFlow/MXnet/ONNX etc to see what different API ideas are out there, and probably will design a new interface under Knet.Layers21, keeping the old one for backward compatibility.

@jeremiedb
Copy link

I'm not sure I understood correctly the map vs reduce reference. Under current Flux approach, passing a sequence with m.(x) (equivalent to map(m, x)), produces the the output vector for each of the timestep, but only the state of the last timestep is accessible as m.state. So in a Text-to-Text with attention scenario, my understanding of the best approach would be to stack those to output vectors into an array as would be expected by a typical QKV attention.

The above scenario is effectively supporting the need for having a RNN operator on 3D input that results in the full sequence array output. That being said, current single approach performance isn't terrible and I can see benefits of it for certain use cases given its flexibility. It already well supports multiple layers Chain(RNN(2,3), RNN(3,4),...), but bidirectional do miss.

To follow the general development pattern, should we have the current RNN cpu implementation moved from Flux to NNlib?

As for the compatibility between CPU and GPU, a simple fix could be keep current Flux RNN implementation and dispatch for 3D input by performing the mapping of the recurrent model over each of the seq_length dimension.
However, I still see an issue from CUDNN parameters being packed, so the simple cpu/gpu conversion of a model with m |> gpu would not work here. Would it be better then to migrate cpu implementation to a packed parameters representation, or would packing/unpacking be feasible as a model is switched from cpu/gpu? I'm missing perspective on those concerns.

I noticed in the linked implementation that embedding layer was even included. I think the level of high-levelness for Flux is subject to some debate, though I think it falls more on the lean side. As such, if I'm not mistaken, I think that Flax/jax doesn't even have those fused CUDNN RNN operator, only the cell operations. I'd be curious to have the take from @DhairyaLGandhi/@CarloLucibello on where to cut the line on the fundamental building blocks to include.

@ToucheSir
Copy link
Contributor

ToucheSir commented Jan 19, 2021

I think @denizyuret's point is that map is generally assumed to be stateless and does not have a defined iteration order. In contrast, reduce (or rather foldl/foldr) feels more semantically correct since it carries over and accumulates state.

I think that Flax/jax doesn't even have those fused CUDNN RNN operator, only the cell operations

AIUI XLA will fuse the cell operations, so there's no need to define a manually fused kernel. I assume this level of fusion doesn't happen in Flux's RNN cells, but would have to compare HLO vs Julia IR to be sure.

@jeremiedb
Copy link

I agree that current broadcasting/map gives a little off feeling like "let's hope it will iterate in order" :)
Though with foldl, if considering the following:

foldl((a, b) -> cell(a, b)[1], x, init=cell.state0)`

The semantic feels better, however I miss how it could provides with the output of each of the sequence step rather than just the final hidden state. The current Recur struct also has a flexible way to handle continuous feeding of the model state to following batches if needed. Could a straight loop over the seq_length of the input be an acceptable implementation?

Anyhow, If @denizyuret you have a clear view on how to provide a CPU support compatible with the CUDNN RNN interface, that would be great, otherwise I can take a shot at leveraging current Flux implementation to support the 3D input & bidirectional.

@denizyuret
Copy link
Contributor

denizyuret commented Jan 19, 2021

A better API design that can utilize the complete CUDNN RNN interface with a CPU implementation would take significantly more time than just providing backward compatibility in Flux, which is the only thing blocking the CUDA.jl release right now. My suggestion would be:

  1. To provide a backward compatible implementation of the old RNNDesc based interface (single layer, no bidirectional etc). This is to be deprecated, but should not take much time. I prefer this code to live in Flux.
  2. Release the CUDA.jl with the new CUDNN interface for Julia 1.5 once Flux passes all its tests (Knet and NNlib already does).
  3. Continue our discussion of a better RNN interface that supports single-step as well as 3D input, bidirectionality, multiple layers, various sequence batch structures (padded, masked, packed etc. for batches with multiple sequence lengths), API for accessing and setting the hidden state etc.
  4. Once we have a better design create a sample CPU implementation in NNlib as reference.
  5. Knet/Flux etc can inherit from this design and have their GPU implementations based on the new CUDNN interface.

To give an idea about the complexity, this link I have shared is an (almost) complete CPU implementation of the CUDNN RNN interface:

https://github.com/denizyuret/Knet.jl/blob/master/src/ops20/rnn.jl

It supports 1-D (single-step), 2-D (single-step with batch), and 3-D (sequence batch) inputs, multiple layers, bidirectionality etc. So it would allow you to use single-step with broadcasting with, e.g. r = RNN(inputsize, hiddensize); r.(x) where x is an iterable of inputs, as well as r(x) where x is a 3D sequence batch. It allows getting and setting the hidden states with r.h = r.c = init; y=r(x); newstate = r.h, r.c which I am not too happy about. Also the packed sequence batch support is incomplete. But it could be a starting point for a better API discussion.

@jeremiedb
Copy link

Thanks, that sounds a good plan.

Regarding:

2. Release the CUDA.jl with the new CUDNN interface for Julia 1.5 once Flux passes all its tests (Knet and NNlib already does).

I'm not sure what's holding this backport to be compatible with current Flux. Given the RNNDesc was removed since Flux v0.11.3, this CUDA PR to break current Flux. @CarloLucibello, was your test failing on v0.11.2 or even on later versions? If Flux v0.11.3+ fine without any change, then simplest seems to keep Flux as it is so this PR can move on?
BTW, fairly new in Flux ecosystem, so my apologies if I'm adding more noise than anything!

@CarloLucibello
Copy link
Contributor

Flux master is fine, but the next release v0.12 will take some time since we are pipelining a few breaking changes.
Since Flux v0.11.2 (and v0.11.3, v0.11.4 containing some backported changes) will be around for a while and is compatible with CUDA.jl 2.x we should be careful in respecting semantic versioning and not introduce breaking changes in CUDA.jl v2.x releases (which I assume will come out of this branch)

@maleadt
Copy link
Member Author

maleadt commented Jan 27, 2021

This seems to be taking longer than expected, so I don't think we'll be backporting this to a 1.5 release. Let's keep the discussion in #523 then.

@maleadt maleadt closed this Jan 27, 2021
@CarloLucibello
Copy link
Contributor

CarloLucibello commented Jan 30, 2021

It's going to be hard for flux to keep supporting julia 1.5 without this backport. And in general, it seems hard to keep up with CUDA.jl aggressive dropping of older julia versions. @maleadt was julia 1.6 kind of special in this regard and things expected to be more retro-fittable in the future? otherwise, we could think about excising cudnn and nnlib's stuff, although it would be a little annoying and contrary to the consolidation direction that gpu packages have gone recently.

@maleadt
Copy link
Member Author

maleadt commented Jan 30, 2021

It's going to be hard for flux to keep supporting julia 1.5 without this backport.

I'm happy to include this in a 1.5 backport release, but it seemed risky wrt. breaking or slowing down existing Flux.jl releases.

About the Julia compatibility: CUDA.jl (and other GPU back-ends) is no "ordinary" package, but extends the Julia compiler and as such is much more sensitive to internal changes. In 1.6, for example, the entire AOT compilation API changed completely. It's very hard to support multiple releases that way, and I don't expect that to improve much. We've been trying to "containerize" that functionality in GPUCompiler.jl, but because that package's API hasn't matured yet changes in its Julia support quickly end up limiting the support of downstream packages like CUDA.jl.

@maleadt
Copy link
Member Author

maleadt commented Feb 24, 2021

@CarloLucibello A backport release of CUDA.jl for Julia 1.5 including the new CUDNN API is up on the compat branch, please have a look. Can tag a release if the ML ecosystem is ready for that.

@CarloLucibello
Copy link
Contributor

thanks! I think the current Flux releases (v0.11) are not compatible with that because of the change in the RNN descriptors.
Master (soon to be v0.12) is fine. We can fix that by creating a compatibility layer or just upper bounding the CUDA.jl compat in the v0.11 series (probably the latter), give me a few days to take care of that, do a few routine checks, and I'll give a green light.

@maleadt
Copy link
Member Author

maleadt commented Feb 25, 2021

If there's an actual incompatibility, the release needs to be breaking and we can't backport this to a 1.5-compatible CUDA.jl release. The next release will be a breaking version, 3.0, but only for Julia 1.6.

@CarloLucibello
Copy link
Contributor

yep, #523 breaks the RNN interface without a deprecation path. That's the only breaking part of that pr I think. It leaked on both compat and master. Since Knet and Flux are likely the only direct consumers of that api, probably can be handled downstream without much effort also if you don't bump the major version, especially if the alternative is dropping julia 1.5 support in Flux. Alternatives I can think of are

  • adding a deprecation for the old RNN descriptor here in the compat branch (this can be done easily I guess, but I can't help since I have really little knowledge of cudnn). cc @denizyuret
  • reserving the v3 series to the next tags of the compat branch and the v4 to master/julia1.6

@CarloLucibello
Copy link
Contributor

All of this seems to suggest that at some point we should factor out a NNlibCUDA package, either here as a subpackage (so that you can keep an eye on it and manage it easily) or somewhere else.

@denizyuret
Copy link
Contributor

yep, #523 breaks the RNN interface without a deprecation path. That's the only breaking part of that pr I think. It leaked on both compat and master.

This is a single file that can be found in the last CUDA.jl release, can't we just copy and paste it to Flux or revive it in CUDA.jl for now until the interface is fixed?

Since Knet and Flux are likely the only direct consumers of that api, probably can be handled downstream without much effort also if you don't bump the major version, especially if the alternative is dropping julia 1.5 support in Flux. Alternatives I can think of are

  • adding a deprecation for the old RNN descriptor here in the compat branch (this can be done easily I guess, but I can't help since I have really little knowledge of cudnn). cc @denizyuret
  • reserving the v3 series to the next tags of the compat branch and the v4 to master/julia1.6

@denizyuret
Copy link
Contributor

All of this seems to suggest that at some point we should factor out a NNlibCUDA package, either here as a subpackage (so that you can keep an eye on it and manage it easily) or somewhere else.

Yes, I have been advocating this too. CUDA/lib/cudnn can be a light wrapper of cudnn library calls and what we do with them can move to their own packages (Knet.CUDNN, NNlibCUDA etc).

@DhairyaLGandhi
Copy link
Member

The interface is basically there already, we have to bring back the cudnn bit from earlier, and we will be able to dispatch back into cudnn

@denizyuret
Copy link
Contributor

I just copied it to #738

As far as I can tell it doesn't break or conflict with anything in the new cudnn interface. See if it fixes Flux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda libraries Stuff about CUDA library wrappers. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants