Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify rec ...Step variants with on-seq variants? #81

Closed
albertz opened this issue Nov 16, 2021 · 5 comments
Closed

Unify rec ...Step variants with on-seq variants? #81

albertz opened this issue Nov 16, 2021 · 5 comments
Milestone

Comments

@albertz
Copy link
Member

albertz commented Nov 16, 2021

I'm continuing to think about this (see #31 for some earlier discussion), at least for modules, not necessarily functions. The reasoning we have two separate variants for rec modules like LSTM and LSTMStep, or SelfAttention and SelfAttentionStep is to have it very explicit whether it operates on a single step or on a sequence. This is again the argument on explicitness.

One issue though is that when the user wants to use the same module twice, once on a single step, and once for the whole sequence. From a conceptual point, nothing speaks against this. However, with this strict separation, this is not possible.

Or maybe the user wants to prepare a model which has multiple modes, and in one mode it operates per step, while in another mode it operates on a whole seq. It would be somewhat annoying to have a branch like self.lstm = (LSTMStep if per_step else LSTM)(...) and similar for every module. Also, this would have the intention that params matches in both cases. But there would not really be much error checking because nothing really guarantees that params from LSTM are compatible to the params of LSTMStep. We just would hope that they would match.

So, I'm questioning whether it is the best design to have two separate modules for each cases.

Some alternatives, all using a single module:

  • Have an option to the module, like per_step: bool or so, which defines the mode (per step vs on sequence).
  • Use the special single_step_dim (via Axis option for per-step execution returnn#847).
  • The module call itself would do one variant (maybe on sequence?) and a separate module.step(...) function would operate per step. This basically would work already except that it would not get wrapped through the __call__ logic inside an own subnetwork. But we could probably do that. Maybe we can also introduce a generic function decorator which does that, like @module_call def step(...). (Such decorator could also be useful in other context. And maybe the user would even not define forward but just @module_call def __call__(...) instead...)
  • Maybe we would have the generic convention that we define all rec modules always per step, and then have a generic functional wrapper like iterate(module, inputs..., axis...) which logically wraps a module in a Loop(). Some modules could choose to have an own specific fast implementation for this (either here on returnn-common side, although RETURNN itself should probably handle most cases already).
    (This would not work for all cases though, where the on seq variant maybe has other modes which are not possible per step, like SelfAttention or BLSTMs.)
  • Extending from the last two options, maybe no __call__ (forward) at all but just two separate functions step and on_seq or so, both maybe via such a @module_call decorator. There could be a default on_seq implementation. Maybe via some mixin Rec class or so.

Maybe there are also other further options. Suggestions are welcome.

@albertz
Copy link
Member Author

albertz commented Dec 8, 2021

Note that SelfAttention is not left-to-right, while SelfAttentionStep is left-to-right, so they are not really the same. Although they surely could (and maybe should) share some common code.

@albertz
Copy link
Member Author

albertz commented Dec 9, 2021

Maybe we should rename SelfAttentionStep to CausalSelfAttentionStep. And then maybe also have a variant CausalSelfAttention operating on a sequence.

Or just have unified CausalSelfAttention and use some of the suggestions.

Or have unified SelfAttention and causal is an option. Although then it becomes very much like the original SelfAttentionLayer with its attention_left_only option. Not sure...

albertz added a commit that referenced this issue Dec 9, 2021
Some related discussion in #81.
@albertz
Copy link
Member Author

albertz commented Dec 16, 2021

We have that decorator now, @nn.scoped.

I like the step() method variant.

@albertz
Copy link
Member Author

albertz commented Dec 16, 2021

Note there is also nn.single_step_dim (used on RETURNN side for similar purpose).

But using that instead of having a separate function (step()) or module is maybe a bit strange as we have two outputs, also including the final state.

Or would we have those two outputs (and also inputs for initial state) anyway always?

@albertz
Copy link
Member Author

albertz commented Dec 16, 2021

As axis: nn.Dim is required for all layers, this would make it already explicit whether the operation is per step (nn.single_step_dim) or on the given axis. I think currently the easiest solution is to just use that consistently. This would then also be a very direct mapping to the RETURNN logic which works the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant