-
Notifications
You must be signed in to change notification settings - Fork 700
torchaudio-contrib: Adding (some) functionals #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -305,3 +305,176 @@ def mu_law_expanding(x_mu, qc): | |||
x = ((x_mu) / mu) * 2 - 1. | |||
x = torch.sign(x) * (torch.exp(torch.abs(x) * torch.log1p(mu)) - 1.) / mu | |||
return x | |||
|
|||
|
|||
def stft(waveforms, fft_length, hop_length=None, win_length=None, window=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is this function takes waveforms of size (..., signal_len) and does stft on the last dimension to return size (..., freq, time, 2).
Is there a way to incorporate this into spectrogram? I think it would be preferable to have a single function for stft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be definitely possible and we can also remove _stft()
then. But I understood this PR to be not changing the previous implementation and behavior, hence not sure about it. (It should definitely happen after the freezing anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of this over torch.stft? We're in the process of adding istft which will be build around being a sane inverse of torch.stft. Having torchaudio.stft and torch.stft will be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR the issue was that we didn't like the default names/values/behaviors of torch.stft
. I think this also boils down to to what extent we'd like these functionals to be less hacky in the phase 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keunwoochoi - we'll need to revisit the discussion around the default names, values and behavior. We cannot add a torchaudio.stft, it'll be very confusing and hacks around an issue that's related to the core of pytorch and we have the power resolve over there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keunwoochoi maybe initially, but at this point the default values are the same as torch.stft
. I think it was more about the names and torchaudio.stft
being multichannel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is keunwoochoi/torchaudio-contrib#36 in torchaudio-contrib around this. The main thing I can tell from your code is that you want batch support for the input (waveforms) and the hann_window as a standard if none is specified. You're also implying the preference of input to be named "waveforms" instead of using input in torch core.
You're also implying the standard of having "channel" first, which is something @jamarshon is looking into as the framework-wide standard. We talked about this today and it seems the most sane, since it's ML specific (and this is an ML library) and since its consistent with the standards used in common operations such as conv1d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ksanjeevan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this separately again today and the main thing this wrapper does is add batching support.
Batching is a big topic overall. I'd prefer if we would add batching support in a separate PR and, if we add it, do so globally, since none of the other transforms have batching support either.
Once you add batching support user will probably want to have support for masks etc., so then we should think it about it more principled or maybe even wait for something like NestedTensor.
This also doesn't block people from using batching if they're aware of the reshape trick.
So, could I please ask you to remove this from the PR and then send again later after we talked more about batching (and after the standardization)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ksanjeevan
Complex number support is something we should revisit separately maybe. torch has a complex dtype, but no support for the used ops. If we treat complex as a dtype we should dispatch on that and use it instead of doing shape-based inference (i.e. the size of the inner dimension as done here). This comes down to a discussion around standards. |
cc @ezyang We need to revisit our policy around complex numbers. it doesn't make sense to have torchaudio specific complex number support, since complex numbers are used across all kinds of domains. Other than that, implementing complex numbers via an inner dimension of 2 can lead to subtle errors. For example, let's say we standardize (most likely) on channel x time. We can then have a real signal from an audio file of say 2 channels and 1000 samples. So, if we align with channel x time, we get a Tensor from torchaudio.load of 2 x 1000. If you pass this into torchaudio.stft you get 2 x 1000 x 2. Now if someone loads an audio file from somewhere else or isn't familiar with this they might pass 1000 x 2 into stft, which then is interpreted as a complex signal of length 1000 instead of what was intended, which is a signal with 2 channels and 1000 samples. So, if we emulate complex numbers using two real numbers we should imo create a "ComplexTensor" extension, which yields support via two real tensors hidden from users. In above case that'd then yield 1000 x 2 being interpreted as 1000 channels of 2 samples, which, while still wrong, is at least consistent and could more easily raise a warning. But then again, this kind of extension might not be what we want given that we already have a complex dtype. There it might make more sense to write a torchaudio c++ operator that dispatches on this dtype for things like magnitude etc. |
I think we should start with the convention of |
+1, at least for this PR. I feel like having our own complex data type is a bit of stretch when |
@soumith - we run the risk of introducing and releasing a standard we need to revert later on. We already have the complex dtype and we also already have C++ extensions for torchaudio. |
If the currently accepted solution to complex number support is the complex dtype, so we should run with that. Otherwise we're introducing a domain specific unicorn solution with domain specific interfaces of generic operations such as stft and its inverse. I propose we instead look into adding a complex flag (or such) to the core stft. I want to get a sense of urgency and eventual operator coverage of complex number support @keunwoochoi on your end, could you advice? |
@cpuhrsch Alright, the main reason we had this less-neat operation was because of the way current I don't think we'd need anything super specific. We definitely need |
cc @Roger-luo |
@keunwoochoi - @ezyang and I just spoke offline about this issue. Depending on your timeline there are two proposals here
On torchaudio.stft: In general there cannot be a torchaudio.stft if there is already a torch.stft and even torchaudio.istft will be moved into torch.istft once it's great, because we also have a torch.ifft for torch.fft. ======= In summary, I suggest we run with the size (... x 2) convention for complex number support, if we need to have this soon, and not add duplicate functions (stft) to torchaudio to work around an issue specific to torch. |
Makes sense, we still keep it and change the name to |
@ksanjeevan - I'd say "batched_stft" or we just add these batching semantics to the core. |
@cpuhrsch I think we were rather relying on your timeline for finishing Phase 1. The plan of phase 1 and 2 (freezing, then making breaking changes) is still valid and this PR is in Phase 1, right? Then I'd be inclined to the option 1. |
I think in short term you could just use two real tensor as complex tensor in Python or C++ API (tho, be careful this may cause 10x slow down comparing to a BLAS binding), as for things like complex real matrix mul you might actually want to use the real BLAS anyway. And if you want a bit better support I think I once made element-wise thing like |
@keunwoochoi - yes indeed, we're currently in the process of pushing the binaries and have a release branch cut. Option 1 sounds reasonable. |
@ksanjeevan - Is there anything blocking you at this point for this PR? |
@cpuhrsch don't think so, there are a couple discussions that seem to be semi-ongoing but I think I've addressed most changes. Should I add |
@ksanjeevan I think it should look okay after that 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we talked through this internally again and are happy with it :) - We'll merge this now.
Thank you so much for your contribution! torchaudio will be much better now.
Adding functionals
stft
multichannel wrapper,complex_norm
,angle
,magphase
andphase_vocoder
from torchaudio_contrib.As discussed with @cpuhrsch & @jamarshon through slack and issues, here is the PR corresponding with the functionals for phase 1. We'll definitely have to review/change things, and we said to do it in the context of this PR.
Some points that come to mind to discuss (and get the ball rolling)
Scope of this first PR
(batch, channel, freq, time)
which is the same astorch.stft
but nottorchaudio
.pytest
vs.unittest
: there's been some comments through slack indicating that usingpytest
could be preferable (?).Larger scope
torchaudio_contrib
we've made all the low level transformsnn.Modules
, to then be chained withnn.Sequential
to make common transforms (e.g. callingSpectrogram(...)
returns ann.Sequential(STFT(), ComplexNorm(power=1.0) )
, a stretched Melspectrogram isnn.Sequential(STFT(), TimeStretch(fixed_rate=1.3), ComplexNorm(power=2), ApplyFilterbank())
, etc.).state_dict
, which maybe we should revisit.(cc @keunwoochoi, @faroit)