Skip to content

[BC] Standardization of Transforms/Functionals #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 24, 2019

Conversation

jamarshon
Copy link
Contributor

@jamarshon jamarshon commented Jul 18, 2019

tl;dr Signals will be (channels, time) (c, n). spectrograms will be (channels, freq, time) (c, f, t) and various transformations will be similar format with channel as first dimension and time as last dimension e.g (c, n_mels, t) or (c, n_mfcc, t)

Removal of Legacy Transforms

-Compose: From #110, "we wouldn't need it because once things are based on Layers people can simply build a nn.Sequential()."
-Aliases and deprecated modules: SPECTROGRAM, F2M, MEL If we are going to start breaking things in this new release, then it makes sense to do as much as possible so that the future stable versions do not have technical debt
-BLC2CBL: there is no input/output to this library that uses BLC or CBL format (going to keep LC2CL mainly because all the inputs are (c, n) meaning it would be convenient for users to switch this format).

Channel x Time

The input (some outputs) of all transforms/functions would be (c, n) e.g Spectrogram, MFCC, MelSpectrogram, Resample, etc. This is to be consistent with PyTorch which has channel followed by the number of samples.
This means removing the channel dimension flags that some transforms were using which could accept either (n, c) or (c, n).

Channel x Frequency x Time

The output of stft is (c, f, t, 2) meaning for each channel, the columns are the fft of a certain window so as we travel horizontally we can see each column (fft) change over time. This matches the output of librosa so we no longer need to transpose in our test comparisions with spectrogram, melscale, and mfcc. Previously it was time x frequency.
image

DCT and Filterbank matrices

Going to keep them as they are. These matrices should be shaped so that they can be multiplied without a transpose e.g for filterbank (..., n_freqs) x (n_freqs, n_mels) meaning each column is a filterbank and DCT (..., n_mels) x (n_mels, n_mfcc) meaning each column is a cos function

Misc

Going to keep scale, padtrim, downmixmono as while they are simple, they could be operations that are performed often so it would be convenient to have the code to reduce amount of code the user has to write as well as provides documentation as the name of the transforms/functions communicates the intent clearly

@jamarshon jamarshon changed the title [WIP] [BC] Standardization of Transforms/Functionals [BC] Standardization of Transforms/Functionals Jul 18, 2019
@cpuhrsch
Copy link
Contributor

@keunwoochoi @ksanjeevan please let us know what you think. This is going to resolve a lot of discussion :)

@ksanjeevan
Copy link
Contributor

@cpuhrsch on the dimension side all great 👍. There still will be much to discuss on some variable/transform names I guess.

@keunwoochoi
Copy link

Great! All the dimension configs are aligned with what we've discussed in the contrib repo :)

If the var names were an issue, we reached at this conclusion if you wanna borrow some idea @cpuhrsch @jamarshon. They're probably slightly more verbose than usual, and that's what we wanted so that even beginners in either Pytorch or audio processing would get it only by reading the code.

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamarshon please look into the mentioned standardization of input flags and add a discussion / changes to this PR.

@jamarshon
Copy link
Contributor Author

jamarshon commented Jul 22, 2019

In terms of variable name standardization, some concepts were taken from keunwoochoi/torchaudio-contrib#36 (comment), mainly:

waveform: a tensor of audio samples
sample_rate: the rate of audio samples (samples per second)
specgram: a tensor of spectrogram
mel_specgram: a mel spectrogram
hop_length: the number of samples between the starts of consecutive frames
n_freqs: the number of bins in a linear spectrogram (i.e., n_freqs == n_fft // 2 + 1)
min_freq: the lowest frequency of the lowest band in a mel/CQT spectrogram
max_freq: the highest frequency of the highest band in a mel/CQT spectrogram

Other variable names
n_fft: the number of fourier bins (matches torch.stft naming)
win_length: the length of the STFT window (matches torch.stft naming)
n_mfcc, n_mels: just to be consistent with other n_* variables
window_fn: for functions that creates windows e.g. torch.hann_window. This is because torch.stft considers window to be a tensor so having window being a function as well would be confusing so instead renaming it to be window_fn to make it clear that it produces window

Currently there is no complex tensors anywhere so no complex_specgram. Also no batching so no waveforms, specgrams, etc.

@jamarshon
Copy link
Contributor Author

@vincentqb @keunwoochoi could you take a look and let me know what you think?

@keunwoochoi
Copy link

Looks good to me. The things mentioned in the original thread and the comment seem aligned to what we had discussed :)
Curious though - what do you mean by no batching?

@cpuhrsch
Copy link
Contributor

@keunwoochoi - see this comment for a wider discussion around this topic.

@jamarshon
Copy link
Contributor Author

#131 (comment)
None of the current transforms/functional receive batched inputs/outputs. I think the goal is to have no batch support in this version and then add batching in the next version

@vincentqb
Copy link
Contributor

I'm happy with the convention, and the code looks good to me. Thanks for updating the code and the tests.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#131 got merged and requires some cleaning

  • remove stft with batching until we decide to support batching everywhere
  • standardize naming to the convention agreed

@vincentqb
Copy link
Contributor

If we are making the code leaner, should we only have DBScale/MelScale transforms without having the SpectrogramDB/MelSpectrogram if they are easy to chain?

@vincentqb
Copy link
Contributor

If we are making the code leaner, should we only have DBScale/MelScale transforms without having the SpectrogramDB/MelSpectrogram if they are easy to chain?

If we do this, we'll do it as part of a separate PR.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@vincentqb
Copy link
Contributor

vincentqb commented Jul 24, 2019

Based on the work done here, here are things we can consider for a later PR:

  • Remove SpectrogramToDB, and create a DBScale transformation like so.
  • Remove MelSpectrogram and keep only MelScale? (comment in favor of keeping)
  • Remove PadTrim.
  • Introduce Padding-only transformation, like so or so. [already in pytorch]
  • Rename MuLawExpanding to MuLawDecoding.

'reflect',
])
@unittest.skipIf(not IMPORT_LIBROSA, 'Librosa is not available')
def test_stft(waveform, fft_length, hop_length, pad_mode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might want to bring this back / add it to core later on for the regular torch.stft

@cpuhrsch cpuhrsch merged commit b29a463 into pytorch:master Jul 24, 2019
@jamarshon jamarshon deleted the standardization branch July 24, 2019 17:58
@jamarshon
Copy link
Contributor Author

jamarshon commented Jul 24, 2019

Based on the work done here, here are things we can consider for a later PR:

  • Remove SpectrogramToDB, and create a DBScale transformation like so.
  • Remove MelSpectrogram and keep only MelScale.
  • Remove PadTrim.
  • Introduce Padding-only transformation, like so.
  • Rename MuLawExpanding to MuLawDecoding.

Rename MuLawExpanding to MuLawDecoding. #159

@keunwoochoi
Copy link

awesome!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants