Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRU implementation details #1671

Closed
mkschleg opened this issue Jul 21, 2021 · 22 comments
Closed

GRU implementation details #1671

mkschleg opened this issue Jul 21, 2021 · 22 comments

Comments

@mkschleg
Copy link
Contributor

There are multiple variations of the gated recurrent unit tied to different versions of the arxiv paper. The variation only changes where the reset vector (r) is applied (either before or after the matrix operation). Currently, we are implementing the version found in v1 of the paper, which follows from what PyTorch does . Tensorflow implements these different variations, with the v3 version being the default. Another driver of this decision is CuDNN seems to only support GRU from v1 of the paper.

Would there be interest in a PR to add the v3 variation?

If we were to add v3 (and maybe other variations), we should likely keep v1 as the default (so as not to break assumptions). But I think it is unclear in the literature exactly what people are using when using GRUs. This paper from Yoshua Bengio's group seems to use v3 to empirically study GRUs. But everyone using Tensorflow or Pytorch are likely using v1 of the GRUs. Tensorflow says the default is v3, but it looks like they are setting the argument reset_after=true. This to me means they are actually using v1 (I haven't dug into their code to figure out what is doing yet because I am never a happy camper when looking at tensorflow...).

@DhairyaLGandhi
Copy link
Member

That sounds interesting! If you have an implementation of the different versions it would be great to have a PR.

@ToucheSir
Copy link
Member

This came up recently elsewhere, thanks for digging into it! Unsurprisingly, it seems like CuDNN is the bottleneck.

@mkschleg
Copy link
Contributor Author

I mentioned this in slack, so that might be what you are thinking about. I can work on a PR for this.

@CarloLucibello
Copy link
Member

Not strictly related, but we also miss bridging v1 to cudnn right now. From what I remember from @jeremiedb's posts, we would need to move away from the vector of vectors representation of a series to a matrix representation

@ToucheSir
Copy link
Member

Yeah, I don't see us trying to re-integrate with CuDNN unless the RNN interface itself undergoes some major changes. And even then, there may be a way to get comparable performance without bending over backwards to accommodate it like other frameworks have.

@mkschleg
Copy link
Contributor Author

mkschleg commented Jul 22, 2021

I do wonder how much performance we are leaving on the table by not using CuDNN. We might be able to get it back with some specialized kernels written in Julia? It might also be related to how memory is laid out. But I'm not sure what CuDNN offers in terms of optimizations over what we can do ourselves. The biggest hurdle would be how we assume data is presented to a network (i.e. array of matrices in Flux vs a 3 dimensional array for CuDNN).

@mkschleg
Copy link
Contributor Author

mkschleg commented Jul 22, 2021

In any case, there are a ton of recurrent architectures out there. And a lot of variants for LSTMs/GRUs (which have support in Pytorch/Tensorflow). It might be useful to figure out how to incorporate new variants of existing models without adding too much complexity. Flux has been a savior for developing novel recurrent architectures, mostly because you can very easily look at the update functions and know how the entire stack is working. So maintaining simplicity where possible would be a must.

The obvious answer is to just have a new type for each variant and have a more complex constructor or just many constructors. But maybe there is another way?

@ToucheSir
Copy link
Member

#1367 (comment) has some performance numbers between the old CuDNN path and current implementation.

As I noted on that thread, Flux is not the only library that eschews CuDNN for its RNN implementation. Certainly they are much easier to implement efficiently than e.g. convs because the only ops involved are simple matmuls and pointwise broadcasting.

@DhairyaLGandhi
Copy link
Member

There's a bunch of optimizations to be had with either approach. It's always a better idea to have multiple backends as well for a ground truth implementation.

@mkschleg what kinds of variations were you thinking about? It's likely that we can share a lot of code and API, so we should avoid many complicated constructors. One key is that we make the API transparent - and be explicit about what we will do with user inputs, and what we expect from them. Usually, we don't need sentinel/ magical values, which would be good to maintain. We also likely want a fairly flat type hierarchy, so there may be multiple types involved, but we can break down the problem so as to maximize how many of these can share interfaces/ behaviour. As you said, maintaining simplicity (and having generic but performant code) is a must.

@mkschleg
Copy link
Contributor Author

This paper lists a bunch. But I'm not sold we actually need to support all the variants, just some. Looking at what Pytorch and Tensorflow support might be a place to start. Another concrete place in the literature to look could be the competitors in this paper.

My research focus doesn't tend to go outside of GRUs and LSTMs as I tend to use them as competitors when developing RL specific recurrent architectures. So I'm less familiar with what is currently becoming popular, but it might be good to think through how we would want to structure this if there was interest in expanding the included architectures in Flux.

I will just start with v1 vs v3 for the GRU and we can go from there.

@mkschleg
Copy link
Contributor Author

@ToucheSir It seems that comparison was made without a fused 3d array, no? I'm sure the optimizations to be had might be more apparent when using the curnn kernel as intended (likely in how memory is handled).

I'm not sure how we would back that with the current paradigm Flux has (i.e. use broadcasting to iterate over time steps). Maybe we could have a special function that does this? But that seems like it might be hard to design to work generally across the library.

@ToucheSir
Copy link
Member

Correct on all points. The old CuDNN path was making pretty poor use of the library because it fed in one timestep at a time. Unfortunately, there's not a consensus on how to (re)design the RNN interface to work more efficiently with 3D+ input data either. That said, if anyone is interested in drafting a proposal for it, I'd be happy to help review and give feedback :)

@jeremiedb
Copy link
Contributor

I see 2 different components to the discussion.

  1. Expanding the available recurrent cells provided by flux (currently RNN, LSTM, GRU)
  2. Leveraging CUDNN for performance.

Regarding 1, I think that current form of the recurrent structure in Flux make it pretty much as straightforward as it gets to define a new/custom recurrent structure. That said, expanding recurrent cell zoo with new ones would be nice. When it comes to variations on LSTM or GRU, I guess having a constructor that takes some extra input configuration options to generate the appropriate structure such as {GRUv3} or {LSTMpeephole} may be the simplest way to go.

As for 2, in the context of an "single time step" paradigm as currently implemented in Flux, I'd disagree with the reliance upon CuDNN as it doesn't improve performance and as for the comfortmity to ground truth, I'd personally prefer the safety from having a single code definition of the cell which dispatches on cpu or gpu thanks to CUDA.jl rather than having an independent implementation from CuDNN taking over when on gpu. Moreover, given the lightweight implementation of the recurrent cells, I feel such approach is more transparent and easy to assess by the user. It's also my perception that such transparent cpu/gpu execution model is part of the attractiveness of the Julia/Flux ecosystem, more than CuDNN bindings.

That said, having a support for a 3D fused sounds like a good addition to leverage optims that are then available. I'd see something of the form `FusedRNN/FusedLSTM...), which would be restricted to 3D inputs and dispatch to CUDNN. As for the cpu support, maybe the simplest to start with would be to define the fused function as a loop over the seq_length dimension, thus reusing the current non-fused implementation (rather than trying to build an optimised 3D cpu implementation)?

@ToucheSir
Copy link
Member

I wonder if we could intercept the broadcasting of certain RNN layers on 3d inputs and use that to dispatch to backends like CuDNN. Layers/functions that map or fold (in either direction) an RNNCell over a sequence would also be an option.

@jeremiedb
Copy link
Contributor

intercept the broadcasting of certain RNN layers on 3d inputs and use that to dispatch to backends like CuDNN

Could you expand a bit on what you would have in mind? My first thought was that if CuDNN is involved, then we are dealing with gpu execution and in such case, CuDNN's RNN/LSTM/GRU is already set to consume the entire 3D array as a single input and apply the full stack of layers onto it. So if CuDNN is involved, I don't see a case for broadcasting as this would seem to brake its optimisation which is the motivation for its usage.

Roughly speaking, my take on a 3D support would go along the lines:

function (m::FusedLSTM(x::CuArray{T,3}))
    cudnn_LSTM(x)
end

function (m::FusedLSTM(x::Array{T,3}))
    # init out    
    for layer in m.layers # support stack of layers as in CuDNN
        out = [layer(x[:,seq,:]) for seq in 1:size(x,2)]
    end
    # need to reconcat into 3D output
    return out
end

@mkschleg
Copy link
Contributor Author

Right. We can definitely do this. But I guess the question is how does it interact with the rest of Flux?

For example, say we have dense layers before the FusedLSTM (which would be expected) how would we get to a 3d array? Currently, we always assume dense takes a matrix and we broadcast over the time steps. We could add support through dispatch for all layer types (or maybe some default but then all layers need to inherit from a base abstract type). Unfortunately, this would also start to get pretty memory intensive if we have to convert an array of matrices to 3d input for every layer. Maybe there is a way to construct it directly, i.e. we could likely do it with tullio or something. But then this adds another layer of complexity on-top of the implementation which may or may not be warranted.

I think the appeal of trying to overload broadcast is that it would preserve the current recurrent API as well. We might even be able to get away with lazily constructing the 3d array right before the LSTM/GRU.

@jeremiedb
Copy link
Contributor

My take on this is that I would have a model that is cohesive with the input structure throughout the model, that is stick to that 3D shape. For example, innitial embedding and Dense layers would be applied to that 3D input. This is the structure I've been accustomed to in MXNet and AFAIK, Dense layer already handle a 3D input, where it is assumed that first first layer is the feature dimension. I'm unsure the new embedding layer, but my understanding is that such 3D struct is supported.

If going with a mix of vector of matrix along 3D RNN, my understanding is that it would resulted in a convoluted execution of the model. For example:

out1 = dense_in.(vec_of_2D) # broadcasting dense over vector of 2D inputs
out2 = FusedRNN(out1) # application of fused RNN over out1 - with or without explicit reshape into 3D
out3 = dense_out.(out2) # another broadcasting of a dense layer to generate output - assumes fused RNN outputs a vec of 2D, otherwise needs a rehsape

The above seems to require some more subjective calls on the expected in/out shapes and is less straighforward than going all the way through the 3D input:

Dense_in(x_3d) |> FusedRNN |> Dense_out

Nor does it have the benefit of the 2D only approach where the entire stack of layers from embedding/dense_in/rnn_cell/dense_out can be applied at a single timestep resulting in complete single step m model that can be broadcasted (actually iterated or left fold to ensure proper execution order) over the vec of inputs.

Where I might be missing the point is that in cases where someone wants to handle a sequence as a vec of 2D inputs, is there actually a need to build a mechanic to send the execution to the CuDNN fused operator? I'm assuming that where a fusedRNN execution would be desired, then a full 3D pipeline would also be expected as is the case in other frameworks using CuDNN's RNN.

@jeremiedb
Copy link
Contributor

Just to add to why it sounds reasonable to me to encourage an end-to-end 3D approach if using a fused RNN operator is that it would also be the way data would be processed in a Transformer model.

@mkschleg
Copy link
Contributor Author

Cool! I definitely had no idea that Dense supported arbitrary dimensioned arrays!

I wouldn't be bothered by having support for both.

@ToucheSir
Copy link
Member

On further thought, I take back the point about broadcasting. Forgot about the discussion we had in FluxML/Zygote.jl#1001, where it was pointed out that both map and broadcasting provide no guarantees about the order of evaluation.

I still think there's value in a wrapper type or something else that composes with the existing RNN cells instead of an entirely different set of layers. Something like TF's TimeDistributed. The semantics would be similar to FusedRNN, but the default behaviour for layers that don't support arbitrary dimensions (e.g. Conv or an entire Chain) could be to run them as a simple loop over each slice of the input. One could imagine variants of this that handle different directions (left-right, right-left, bidirectional) and output types (single output, seq-to-seq)—whether it would be better to express those as type params or separate types I'm not sure.

@darsnack
Copy link
Member

I guess having a constructor that takes some extra input configuration options to generate the appropriate structure such as {GRUv3} or {LSTMpeephole} may be the simplest way to go.

Minor suggestion (don't want to derail the rest of the discussion): LSTM{peephole}/LSTM(...; peephole = true) are just as verbose as LSTMPeephole. I would suggest a separate type here instead of constructor options and conditionals in the forward pass. You can still share forward code with a plain function when desired.

@mkschleg mkschleg mentioned this issue Jul 23, 2021
4 tasks
@mkschleg
Copy link
Contributor Author

We can use the PR I made to discuss how to design the interface. I added a first pass just doing the minimum at first (i.e. just a separate type as suggested by @darsnack).

bors bot added a commit that referenced this issue Aug 2, 2021
1675: Adding GRUv3 support. r=darsnack a=mkschleg

As per the starting discussion in #1671, we should provide support for variations on the GRU and LSTM cell. 

In this PR, I added support for the GRU found in v3 of the [original GRU paper](https://arxiv.org/abs/1406.1078). Current support in Flux is for v1 only. [Tensorflow](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) supports several variations, with this as one of the variations. 

While the feature is added and usable in this PR, this is only a first pass at a design and could use further iterations. Some questions I have:
- Should we have new types for each variation of these cells? (another possibility is through parametric options)
- Should we have a shared constructor similar to Tensorflow/Pytorch? (it might make sense to rename the current GRU to GRUv1 if we want to do this).

### PR Checklist

- [x] Tests are added
- [x] Entry in NEWS.md
- [x] Documentation, if applicable
- [ ] API changes require approval from a committer (different from the author, if applicable)


Co-authored-by: Matthew Schlegel <mkschleg@gmail.com>
Co-authored-by: Matthew Schlegel <mkschleg@users.noreply.github.com>
@mkschleg mkschleg closed this as completed Aug 2, 2021
bors bot added a commit that referenced this issue Sep 14, 2021
1686: Adding support for folding RNNs over 3d arrays r=DhairyaLGandhi a=mkschleg

From #1678, adding a Recur like interface for a folded operation with support for 3-dimensional arrays. This is how many users expect RNNs to work if they are familiar with Pytorch and Tensorflow, and there seems to be some desire for support for this feature as per the discussion in #1671 and `@jeremiedb` .  This will also make a push to implementing support for the CuDNN versions of RNNs/GRUs/LSTMs more streamlined as this is the data layout that API expects. 

I did a barebones implementation to add support so we can start iterating on API.

There are several questions that I have lingering with this interface:
- ~Should we support different modes where we return all or only the last hidden state? Is there a better way to do the concat of the hidden states?~
- What kind of tests should we have? Just follow what we currently do for RNNs/LSTMs/GRUs?
- ~For the CPU version, does it make sense not to specialize on the different rnn types? We might be able to take more advantage of BLAS if we specialized on say `Folded{GRU}`.~
- ~Do we want to force the temporal dimension to be the 2nd?~
- ~Do we want this to be stateful? (i.e. allow the user to change what the starting hidden state is rather than state0).~

### PR Checklist

- [x] Tests are added
- [ ] Entry in NEWS.md
- [x] Documentation, if applicable
- [ ] API changes require approval from a committer (different from the author, if applicable)


Co-authored-by: Matthew Schlegel <mkschleg@gmail.com>
Co-authored-by: Matthew Schlegel <mkschleg@users.noreply.github.com>
Co-authored-by: Dhairya Gandhi <dhairya@juliacomputing.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants