Description
First of all, thanks for the great work and clean code!
For the purpose of training a model on the discrete codes (as opposed to just encoding and decoding a signal), the current chunked inference is not ideal. As nicely summarized by @cinjon in #35, the current implementation slices up the input into chunks of about the requested chunk length, encodes them separately, and saves the blocks of latent codes along with the chunk length. However, concatenating the separately encoded chunks gives a different sequence of discrete codes than encoding the whole sequence at once (or, more generally, using a different chunk size). Specifically, decoding with a larger chunk size will lead to repeated audio segments at the original chunk boundaries (about 5ms per boundary in the default settings). This means a model cannot be fed with arbitrary excerpts from the discrete code sequence; the excerpts have to be aligned on chunk boundaries to be meaningful, and the model will have to learn to model the boundaries at the expected positions. It also means I cannot jump to a specific position in the audio by just multiplying the timestamp by 86Hz.
Since the model is convolutional, it is possible to implement a chunked inference that gives the same result as passing the full signal (except at the beginning and end, since we cannot simulate the zero-padding of hidden layers by padding the input). This entails setting padding to 0 in all Conv1d layers, zero-padding the input signal / code sequence (before chunking), and overlapping the chunks by the amount of padding. The current implementation already sets padding to 0, and pads the input, but chose a different strategy: to obtain the same codes, the input signal chunks would overlap by the amount they're padded with, and the code chunks would be padded and overlapped as well, but the decompression routine neither pads nor overlaps the codes. Instead, it relies on the input signal being padded and overlapped to cater for both the encoder and the decoder (i.e., it produces overlapped and padded code chunks for the chunk length that was used for encoding).