Description
There are multiple variations of the gated recurrent unit tied to different versions of the arxiv paper. The variation only changes where the reset vector (r
) is applied (either before or after the matrix operation). Currently, we are implementing the version found in v1 of the paper, which follows from what PyTorch does . Tensorflow implements these different variations, with the v3 version being the default. Another driver of this decision is CuDNN seems to only support GRU from v1 of the paper.
Would there be interest in a PR to add the v3 variation?
If we were to add v3 (and maybe other variations), we should likely keep v1 as the default (so as not to break assumptions). But I think it is unclear in the literature exactly what people are using when using GRUs. This paper from Yoshua Bengio's group seems to use v3 to empirically study GRUs. But everyone using Tensorflow or Pytorch are likely using v1 of the GRUs. Tensorflow says the default is v3, but it looks like they are setting the argument reset_after=true
. This to me means they are actually using v1 (I haven't dug into their code to figure out what is doing yet because I am never a happy camper when looking at tensorflow...).