The encoder interface is quite trivial, basically just any [LayerRef] -> LayerRef function, although the interface also should imply the tensor format {B,T,D} or so.
The idea was to have a generic interface for the decoder which allows to define both a transducer (in its most generic form, including RNN-T, RNA, etc) either time-sync or alignment-sync, and a standard attention-based label-sync decoder.
The interface should allow for easy integration of an external LM, and also allow for integration of ILM estimation and subtraction.
A current draft is here.
We should implement some attention-based encoder-decoder and some transducer example using external LM + ILM estimation and subtraction as example.
Transformer should then also be refactored to make use of this interface.