Skip to content

torchaudio-contrib: Adding (some) functionals #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 24, 2019

Conversation

ksanjeevan
Copy link
Contributor

Adding functionals stft multichannel wrapper, complex_norm, angle, magphase and phase_vocoder from torchaudio_contrib.

As discussed with @cpuhrsch & @jamarshon through slack and issues, here is the PR corresponding with the functionals for phase 1. We'll definitely have to review/change things, and we said to do it in the context of this PR.

Some points that come to mind to discuss (and get the ball rolling)

Scope of this first PR

  • naming convention. More thorough discussion going on in #36.
  • shape of grams: we're using (batch, channel, freq, time) which is the same as torch.stft but not torchaudio.
  • pytest vs. unittest: there's been some comments through slack indicating that using pytest could be preferable (?).

Larger scope

  • in torchaudio_contrib we've made all the low level transforms nn.Modules, to then be chained with nn.Sequential to make common transforms (e.g. calling Spectrogram(...) returns a nn.Sequential(STFT(), ComplexNorm(power=1.0) ), a stretched Melspectrogram is nn.Sequential(STFT(), TimeStretch(fixed_rate=1.3), ComplexNorm(power=2), ApplyFilterbank()) , etc.).
  • We also have the transform parameters as buffers and not including them in the state_dict, which maybe we should revisit.

(cc @keunwoochoi, @faroit)

@@ -305,3 +305,176 @@ def mu_law_expanding(x_mu, qc):
x = ((x_mu) / mu) * 2 - 1.
x = torch.sign(x) * (torch.exp(torch.abs(x) * torch.log1p(mu)) - 1.) / mu
return x


def stft(waveforms, fft_length, hop_length=None, win_length=None, window=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is this function takes waveforms of size (..., signal_len) and does stft on the last dimension to return size (..., freq, time, 2).

Is there a way to incorporate this into spectrogram? I think it would be preferable to have a single function for stft

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be definitely possible and we can also remove _stft() then. But I understood this PR to be not changing the previous implementation and behavior, hence not sure about it. (It should definitely happen after the freezing anyway)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of this over torch.stft? We're in the process of adding istft which will be build around being a sane inverse of torch.stft. Having torchaudio.stft and torch.stft will be confusing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR the issue was that we didn't like the default names/values/behaviors of torch.stft. I think this also boils down to to what extent we'd like these functionals to be less hacky in the phase 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keunwoochoi - we'll need to revisit the discussion around the default names, values and behavior. We cannot add a torchaudio.stft, it'll be very confusing and hacks around an issue that's related to the core of pytorch and we have the power resolve over there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keunwoochoi maybe initially, but at this point the default values are the same as torch.stft. I think it was more about the names and torchaudio.stft being multichannel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is keunwoochoi/torchaudio-contrib#36 in torchaudio-contrib around this. The main thing I can tell from your code is that you want batch support for the input (waveforms) and the hann_window as a standard if none is specified. You're also implying the preference of input to be named "waveforms" instead of using input in torch core.

You're also implying the standard of having "channel" first, which is something @jamarshon is looking into as the framework-wide standard. We talked about this today and it seems the most sane, since it's ML specific (and this is an ML library) and since its consistent with the standards used in common operations such as conv1d.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this separately again today and the main thing this wrapper does is add batching support.

Batching is a big topic overall. I'd prefer if we would add batching support in a separate PR and, if we add it, do so globally, since none of the other transforms have batching support either.

Once you add batching support user will probably want to have support for masks etc., so then we should think it about it more principled or maybe even wait for something like NestedTensor.

This also doesn't block people from using batching if they're aware of the reshape trick.

So, could I please ask you to remove this from the PR and then send again later after we talked more about batching (and after the standardization)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamarshon jamarshon mentioned this pull request Jul 5, 2019
@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 5, 2019

Complex number support is something we should revisit separately maybe. torch has a complex dtype, but no support for the used ops. If we treat complex as a dtype we should dispatch on that and use it instead of doing shape-based inference (i.e. the size of the inner dimension as done here).

This comes down to a discussion around standards.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 5, 2019

cc @ezyang

We need to revisit our policy around complex numbers. it doesn't make sense to have torchaudio specific complex number support, since complex numbers are used across all kinds of domains.

Other than that, implementing complex numbers via an inner dimension of 2 can lead to subtle errors.

For example, let's say we standardize (most likely) on channel x time. We can then have a real signal from an audio file of say 2 channels and 1000 samples. So, if we align with channel x time, we get a Tensor from torchaudio.load of 2 x 1000. If you pass this into torchaudio.stft you get 2 x 1000 x 2.

Now if someone loads an audio file from somewhere else or isn't familiar with this they might pass 1000 x 2 into stft, which then is interpreted as a complex signal of length 1000 instead of what was intended, which is a signal with 2 channels and 1000 samples.

So, if we emulate complex numbers using two real numbers we should imo create a "ComplexTensor" extension, which yields support via two real tensors hidden from users. In above case that'd then yield 1000 x 2 being interpreted as 1000 channels of 2 samples, which, while still wrong, is at least consistent and could more easily raise a warning.

But then again, this kind of extension might not be what we want given that we already have a complex dtype. There it might make more sense to write a torchaudio c++ operator that dispatches on this dtype for things like magnitude etc.

@soumith
Copy link
Member

soumith commented Jul 6, 2019

I think we should start with the convention of trailing dimension size = 2, and then graduate to a Complex boxed Tensor

@keunwoochoi
Copy link

we should start with the convention of trailing dimension size = 2

+1, at least for this PR. I feel like having our own complex data type is a bit of stretch when torch.stft doesn't use it.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 8, 2019

@soumith - we run the risk of introducing and releasing a standard we need to revert later on. We already have the complex dtype and we also already have C++ extensions for torchaudio.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 8, 2019

If the currently accepted solution to complex number support is the complex dtype, so we should run with that. Otherwise we're introducing a domain specific unicorn solution with domain specific interfaces of generic operations such as stft and its inverse. I propose we instead look into adding a complex flag (or such) to the core stft.

I want to get a sense of urgency and eventual operator coverage of complex number support @keunwoochoi on your end, could you advice?

@keunwoochoi
Copy link

@cpuhrsch Alright, the main reason we had this less-neat operation was because of the way current torch.stft represent the complex number. Especially when we're implementing istft as well here, I'm sure you have a plan to cover these all with consistency :)

I don't think we'd need anything super specific. We definitely need angle, norm. Besides, the very basic stuff should be fine? +, -, *, / (between complex and complex-to-real) would be already in your list I assume and some of them are used in applying phase vocoder in our implementation.

@ezyang
Copy link
Contributor

ezyang commented Jul 8, 2019

cc @Roger-luo

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 8, 2019

@keunwoochoi - @ezyang and I just spoke offline about this issue. Depending on your timeline there are two proposals here

  1. If we need complex number support quickly, we'll go with the implicit PyTorch convention of returning a Tensor with last dimension 2. torch.fft does this and so does torch.stft. This will happen with the expectation that it'll be changed in the future.

  2. If there is time, we'd want to introduce a ComplexTensor wrapper, which links into our wider story around wrappers. It'll be used in core and also change the type torch.stft will return it.

On torchaudio.stft: In general there cannot be a torchaudio.stft if there is already a torch.stft and even torchaudio.istft will be moved into torch.istft once it's great, because we also have a torch.ifft for torch.fft.

=======

In summary, I suggest we run with the size (... x 2) convention for complex number support, if we need to have this soon, and not add duplicate functions (stft) to torchaudio to work around an issue specific to torch.

@ksanjeevan
Copy link
Contributor Author

On torchaudio.stft: In general there cannot be a torchaudio.stft if there is already a torch.stft and even torchaudio.istft will be moved into torch.istft once it's great, because we also have a torch.ifft for torch.fft.

Makes sense, we still keep it and change the name to multichannel_stft or something.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Jul 9, 2019

@ksanjeevan - I'd say "batched_stft" or we just add these batching semantics to the core.

@keunwoochoi
Copy link

Depending on your timeline

@cpuhrsch I think we were rather relying on your timeline for finishing Phase 1. The plan of phase 1 and 2 (freezing, then making breaking changes) is still valid and this PR is in Phase 1, right? Then I'd be inclined to the option 1.

@Roger-luo
Copy link

Roger-luo commented Jul 10, 2019

I think in short term you could just use two real tensor as complex tensor in Python or C++ API (tho, be careful this may cause 10x slow down comparing to a BLAS binding), as for things like complex real matrix mul you might actually want to use the real BLAS anyway. And if you want a bit better support I think I once made element-wise thing like +, - work in torch-complex extension for CPU, but it doesn't work due to the change of C++ part and cause some issue to CUDA. The ATen library has a generic implementation of broadcast I think, which should make this not hard to implement.

@cpuhrsch
Copy link
Contributor

@keunwoochoi - yes indeed, we're currently in the process of pushing the binaries and have a release branch cut.

Option 1 sounds reasonable.

@cpuhrsch
Copy link
Contributor

@ksanjeevan - Is there anything blocking you at this point for this PR?

@ksanjeevan
Copy link
Contributor Author

ksanjeevan commented Jul 11, 2019

@cpuhrsch don't think so, there are a couple discussions that seem to be semi-ongoing but I think I've addressed most changes.

Should I add pytest in requirements.txt?

@jamarshon
Copy link
Contributor

@ksanjeevan
Yes that sounds good. Could you also do the following:
-merge master into this PR (so we can see that it merges cleanly and to get the status of the CI)
-add pytest to requirements.txt
-change the CI e.g python -m pytest -v $FILE in test_script.sh
-resolve the comments (here and here)

I think it should look okay after that 😄

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we talked through this internally again and are happy with it :) - We'll merge this now.

Thank you so much for your contribution! torchaudio will be much better now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants