Description
Hello, I would like to use a model's codebook to label a dataset of audio files by chunks. The first problem I am encountering is to find out the actual stride of the model, that is how much of the audio it can label for one entry. SO I tryed the following: first I tryed, on the 24khz model, to divide model.hop_length/24000, and the result is 0.013333333333333334(which should be a value in seconds), but it doesn't make that much sense as it's not that close to an int. So then I tryed labeling an audio 79.55990929705216 seconds long. In the resultin class, chunk_length is 72, so I tryed dividing the number of resulting representations (result.codes.shape[-1]) which is 8928 by 72, and the result is 124.0. Then, trying to divide the total length in seconds by this value, I obtain 0.6416121717504206(seconds) which is not even close to my first value, even trying to adjust the numbers to account for some sort of padding. This might be a stupid question but I think it's worth asking to save me and everyone else some time as I'm quite lost, I also tryed taking a look through the very clear source code and paper but I think I'm missing something very obvious. Thanks! P.S, if it's necessary please try to include verbal explanations of the images you might want to include in your comments as I am totally blind.