GRU implementation details

There are multiple variations of the gated recurrent unit tied to different versions of the [arxiv](https://arxiv.org/abs/1406.1078) paper. The variation only changes where the reset vector (`r`) is applied (either before or after the matrix operation). Currently, we are implementing the version found in v1 of the paper, which follows from what [PyTorch does](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html) . Tensorflow implements these different [variations](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU), with the v3 version being the default. Another driver of this decision is CuDNN seems to only support GRU from v1 of the paper.

Would there be interest in a PR to add the v3 variation? 

If we were to add v3 (and maybe other variations), we should likely keep v1 as the default (so as not to break assumptions). But I think it is unclear in the literature exactly what people are using when using GRUs. This [paper](https://arxiv.org/pdf/1412.3555.pdf) from Yoshua Bengio's group seems to use v3 to empirically study GRUs. But everyone using Tensorflow or Pytorch are likely using v1 of the GRUs. Tensorflow says the default is v3, but it looks like they are setting the argument `reset_after=true`. This to me means they are actually using v1 (I haven't dug into their code to figure out what is doing yet because I am never a happy camper when looking at tensorflow...).  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GRU implementation details #1671

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GRU implementation details #1671

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions