-
Notifications
You must be signed in to change notification settings - Fork 696
[BC] Standardization of Transforms/Functionals #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@keunwoochoi @ksanjeevan please let us know what you think. This is going to resolve a lot of discussion :) |
@cpuhrsch on the dimension side all great 👍. There still will be much to discuss on some variable/transform names I guess. |
Great! All the dimension configs are aligned with what we've discussed in the If the var names were an issue, we reached at this conclusion if you wanna borrow some idea @cpuhrsch @jamarshon. They're probably slightly more verbose than usual, and that's what we wanted so that even beginners in either Pytorch or audio processing would get it only by reading the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamarshon please look into the mentioned standardization of input flags and add a discussion / changes to this PR.
In terms of variable name standardization, some concepts were taken from keunwoochoi/torchaudio-contrib#36 (comment), mainly:
Other variable names Currently there is no complex tensors anywhere so no |
@vincentqb @keunwoochoi could you take a look and let me know what you think? |
Looks good to me. The things mentioned in the original thread and the comment seem aligned to what we had discussed :) |
@keunwoochoi - see this comment for a wider discussion around this topic. |
#131 (comment) |
I'm happy with the convention, and the code looks good to me. Thanks for updating the code and the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#131 got merged and requires some cleaning
- remove stft with batching until we decide to support batching everywhere
- standardize naming to the convention agreed
If we are making the code leaner, should we only have DBScale/MelScale transforms without having the SpectrogramDB/MelSpectrogram if they are easy to chain? |
If we do this, we'll do it as part of a separate PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Based on the work done here, here are things we can consider for a later PR:
|
'reflect', | ||
]) | ||
@unittest.skipIf(not IMPORT_LIBROSA, 'Librosa is not available') | ||
def test_stft(waveform, fft_length, hop_length, pad_mode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why might want to bring this back / add it to core later on for the regular torch.stft
Rename MuLawExpanding to MuLawDecoding. #159 |
awesome!!! |
tl;dr Signals will be (channels, time) (c, n). spectrograms will be (channels, freq, time) (c, f, t) and various transformations will be similar format with channel as first dimension and time as last dimension e.g (c, n_mels, t) or (c, n_mfcc, t)
Removal of Legacy Transforms
-Compose: From #110, "we wouldn't need it because once things are based on Layers people can simply build a nn.Sequential()."
-Aliases and deprecated modules: SPECTROGRAM, F2M, MEL If we are going to start breaking things in this new release, then it makes sense to do as much as possible so that the future stable versions do not have technical debt
-BLC2CBL: there is no input/output to this library that uses BLC or CBL format (going to keep LC2CL mainly because all the inputs are (c, n) meaning it would be convenient for users to switch this format).
Channel x Time
The input (some outputs) of all transforms/functions would be (c, n) e.g Spectrogram, MFCC, MelSpectrogram, Resample, etc. This is to be consistent with PyTorch which has channel followed by the number of samples.
This means removing the channel dimension flags that some transforms were using which could accept either (n, c) or (c, n).
Channel x Frequency x Time
The output of stft is (c, f, t, 2) meaning for each channel, the columns are the fft of a certain window so as we travel horizontally we can see each column (fft) change over time. This matches the output of librosa so we no longer need to transpose in our test comparisions with spectrogram, melscale, and mfcc. Previously it was time x frequency.

DCT and Filterbank matrices
Going to keep them as they are. These matrices should be shaped so that they can be multiplied without a transpose e.g for filterbank (..., n_freqs) x (n_freqs, n_mels) meaning each column is a filterbank and DCT (..., n_mels) x (n_mels, n_mfcc) meaning each column is a cos function
Misc
Going to keep scale, padtrim, downmixmono as while they are simple, they could be operations that are performed often so it would be convenient to have the code to reduce amount of code the user has to write as well as provides documentation as the name of the transforms/functions communicates the intent clearly