torchaudio-contrib: Adding (some) functionals #131

ksanjeevan · 2019-07-03T23:31:35Z

Adding functionals stft multichannel wrapper, complex_norm, angle, magphase and phase_vocoder from torchaudio_contrib.

As discussed with @cpuhrsch & @jamarshon through slack and issues, here is the PR corresponding with the functionals for phase 1. We'll definitely have to review/change things, and we said to do it in the context of this PR.

Some points that come to mind to discuss (and get the ball rolling)

Scope of this first PR

naming convention. More thorough discussion going on in #36.
shape of grams: we're using (batch, channel, freq, time) which is the same as torch.stft but not torchaudio.
pytest vs. unittest: there's been some comments through slack indicating that using pytest could be preferable (?).

Larger scope

in torchaudio_contrib we've made all the low level transforms nn.Modules, to then be chained with nn.Sequential to make common transforms (e.g. calling Spectrogram(...) returns a nn.Sequential(STFT(), ComplexNorm(power=1.0) ), a stretched Melspectrogram is nn.Sequential(STFT(), TimeStretch(fixed_rate=1.3), ComplexNorm(power=2), ApplyFilterbank()) , etc.).
We also have the transform parameters as buffers and not including them in the state_dict, which maybe we should revisit.

(cc @keunwoochoi, @faroit)

test/common_utils.py

jamarshon · 2019-07-05T14:17:18Z

torchaudio/functional.py

@@ -305,3 +305,176 @@ def mu_law_expanding(x_mu, qc):
    x = ((x_mu) / mu) * 2 - 1.
    x = torch.sign(x) * (torch.exp(torch.abs(x) * torch.log1p(mu)) - 1.) / mu
    return x
+
+
+def stft(waveforms, fft_length, hop_length=None, win_length=None, window=None,


My understanding is this function takes waveforms of size (..., signal_len) and does stft on the last dimension to return size (..., freq, time, 2).

Is there a way to incorporate this into spectrogram? I think it would be preferable to have a single function for stft

It'd be definitely possible and we can also remove _stft() then. But I understood this PR to be not changing the previous implementation and behavior, hence not sure about it. (It should definitely happen after the freezing anyway)

What's the advantage of this over torch.stft? We're in the process of adding istft which will be build around being a sane inverse of torch.stft. Having torchaudio.stft and torch.stft will be confusing.

AFAIR the issue was that we didn't like the default names/values/behaviors of torch.stft. I think this also boils down to to what extent we'd like these functionals to be less hacky in the phase 1.

@keunwoochoi - we'll need to revisit the discussion around the default names, values and behavior. We cannot add a torchaudio.stft, it'll be very confusing and hacks around an issue that's related to the core of pytorch and we have the power resolve over there.

@keunwoochoi maybe initially, but at this point the default values are the same as torch.stft. I think it was more about the names and torchaudio.stft being multichannel.

I see there is keunwoochoi/torchaudio-contrib#36 in torchaudio-contrib around this. The main thing I can tell from your code is that you want batch support for the input (waveforms) and the hann_window as a standard if none is specified. You're also implying the preference of input to be named "waveforms" instead of using input in torch core.

You're also implying the standard of having "channel" first, which is something @jamarshon is looking into as the framework-wide standard. We talked about this today and it seems the most sane, since it's ML specific (and this is an ML library) and since its consistent with the standards used in common operations such as conv1d.

cc @ksanjeevan

We talked about this separately again today and the main thing this wrapper does is add batching support.

Batching is a big topic overall. I'd prefer if we would add batching support in a separate PR and, if we add it, do so globally, since none of the other transforms have batching support either.

Once you add batching support user will probably want to have support for masks etc., so then we should think it about it more principled or maybe even wait for something like NestedTensor.

This also doesn't block people from using batching if they're aware of the reshape trick.

So, could I please ask you to remove this from the PR and then send again later after we talked more about batching (and after the standardization)?

cc @ksanjeevan

test/test_transforms.py

cpuhrsch · 2019-07-05T19:29:02Z

Complex number support is something we should revisit separately maybe. torch has a complex dtype, but no support for the used ops. If we treat complex as a dtype we should dispatch on that and use it instead of doing shape-based inference (i.e. the size of the inner dimension as done here).

This comes down to a discussion around standards.

test/test_transforms.py

cpuhrsch · 2019-07-05T20:13:09Z

cc @ezyang

We need to revisit our policy around complex numbers. it doesn't make sense to have torchaudio specific complex number support, since complex numbers are used across all kinds of domains.

Other than that, implementing complex numbers via an inner dimension of 2 can lead to subtle errors.

For example, let's say we standardize (most likely) on channel x time. We can then have a real signal from an audio file of say 2 channels and 1000 samples. So, if we align with channel x time, we get a Tensor from torchaudio.load of 2 x 1000. If you pass this into torchaudio.stft you get 2 x 1000 x 2.

Now if someone loads an audio file from somewhere else or isn't familiar with this they might pass 1000 x 2 into stft, which then is interpreted as a complex signal of length 1000 instead of what was intended, which is a signal with 2 channels and 1000 samples.

So, if we emulate complex numbers using two real numbers we should imo create a "ComplexTensor" extension, which yields support via two real tensors hidden from users. In above case that'd then yield 1000 x 2 being interpreted as 1000 channels of 2 samples, which, while still wrong, is at least consistent and could more easily raise a warning.

But then again, this kind of extension might not be what we want given that we already have a complex dtype. There it might make more sense to write a torchaudio c++ operator that dispatches on this dtype for things like magnitude etc.

soumith · 2019-07-06T23:22:20Z

I think we should start with the convention of trailing dimension size = 2, and then graduate to a Complex boxed Tensor

keunwoochoi · 2019-07-08T04:56:52Z

we should start with the convention of trailing dimension size = 2

+1, at least for this PR. I feel like having our own complex data type is a bit of stretch when torch.stft doesn't use it.

cpuhrsch · 2019-07-08T19:53:35Z

@soumith - we run the risk of introducing and releasing a standard we need to revert later on. We already have the complex dtype and we also already have C++ extensions for torchaudio.

cpuhrsch · 2019-07-08T20:08:01Z

If the currently accepted solution to complex number support is the complex dtype, so we should run with that. Otherwise we're introducing a domain specific unicorn solution with domain specific interfaces of generic operations such as stft and its inverse. I propose we instead look into adding a complex flag (or such) to the core stft.

I want to get a sense of urgency and eventual operator coverage of complex number support @keunwoochoi on your end, could you advice?

keunwoochoi · 2019-07-08T20:24:18Z

@cpuhrsch Alright, the main reason we had this less-neat operation was because of the way current torch.stft represent the complex number. Especially when we're implementing istft as well here, I'm sure you have a plan to cover these all with consistency :)

I don't think we'd need anything super specific. We definitely need angle, norm. Besides, the very basic stuff should be fine? +, -, *, / (between complex and complex-to-real) would be already in your list I assume and some of them are used in applying phase vocoder in our implementation.

ezyang · 2019-07-08T20:56:38Z

cc @Roger-luo

cpuhrsch · 2019-07-08T21:03:21Z

@keunwoochoi - @ezyang and I just spoke offline about this issue. Depending on your timeline there are two proposals here

If we need complex number support quickly, we'll go with the implicit PyTorch convention of returning a Tensor with last dimension 2. torch.fft does this and so does torch.stft. This will happen with the expectation that it'll be changed in the future.
If there is time, we'd want to introduce a ComplexTensor wrapper, which links into our wider story around wrappers. It'll be used in core and also change the type torch.stft will return it.

On torchaudio.stft: In general there cannot be a torchaudio.stft if there is already a torch.stft and even torchaudio.istft will be moved into torch.istft once it's great, because we also have a torch.ifft for torch.fft.

=======

In summary, I suggest we run with the size (... x 2) convention for complex number support, if we need to have this soon, and not add duplicate functions (stft) to torchaudio to work around an issue specific to torch.

ksanjeevan · 2019-07-08T21:11:59Z

On torchaudio.stft: In general there cannot be a torchaudio.stft if there is already a torch.stft and even torchaudio.istft will be moved into torch.istft once it's great, because we also have a torch.ifft for torch.fft.

Makes sense, we still keep it and change the name to multichannel_stft or something.

cpuhrsch · 2019-07-09T18:20:45Z

@ksanjeevan - I'd say "batched_stft" or we just add these batching semantics to the core.

keunwoochoi · 2019-07-09T19:00:34Z

Depending on your timeline

@cpuhrsch I think we were rather relying on your timeline for finishing Phase 1. The plan of phase 1 and 2 (freezing, then making breaking changes) is still valid and this PR is in Phase 1, right? Then I'd be inclined to the option 1.

Roger-luo · 2019-07-10T01:18:06Z

I think in short term you could just use two real tensor as complex tensor in Python or C++ API (tho, be careful this may cause 10x slow down comparing to a BLAS binding), as for things like complex real matrix mul you might actually want to use the real BLAS anyway. And if you want a bit better support I think I once made element-wise thing like +, - work in torch-complex extension for CPU, but it doesn't work due to the change of C++ part and cause some issue to CUDA. The ATen library has a generic implementation of broadcast I think, which should make this not hard to implement.

cpuhrsch · 2019-07-10T19:24:43Z

@keunwoochoi - yes indeed, we're currently in the process of pushing the binaries and have a release branch cut.

Option 1 sounds reasonable.

cpuhrsch · 2019-07-11T17:39:14Z

@ksanjeevan - Is there anything blocking you at this point for this PR?

ksanjeevan · 2019-07-11T17:46:07Z

@cpuhrsch don't think so, there are a couple discussions that seem to be semi-ongoing but I think I've addressed most changes.

Should I add pytest in requirements.txt?

test/test_transforms.py

jamarshon · 2019-07-16T17:29:13Z

@ksanjeevan
Yes that sounds good. Could you also do the following:
-merge master into this PR (so we can see that it merges cleanly and to get the status of the CI)
-add pytest to requirements.txt
-change the CI e.g python -m pytest -v $FILE in test_script.sh
-resolve the comments (here and here)

I think it should look okay after that 😄

…b-func1

torchaudio/functional.py

build_tools/travis/test_script.sh

test/common_utils.py

…b-func1

test/test_functional.py

cpuhrsch

See inline comment

cpuhrsch

we talked through this internally again and are happy with it :) - We'll merge this now.

Thank you so much for your contribution! torchaudio will be much better now.

ksanjeevan added 2 commits July 3, 2019 14:23

Added functionals and tests

3397bcc

Some typos, added magphase

ca63008

jamarshon reviewed Jul 5, 2019

View reviewed changes

test/common_utils.py Outdated Show resolved Hide resolved

jamarshon reviewed Jul 5, 2019

View reviewed changes

jamarshon mentioned this pull request Jul 5, 2019

ISTFT #130

Closed

cpuhrsch reviewed Jul 5, 2019

View reviewed changes