Description
Hi all,
I think it's good timing to discuss a potential merging plan from torchaudio-contrib to here, especially because there's going to be new features and changes by @jamarshon @cpuhrsch.
Main idea
A lot of things are well summarized in https://github.com/keunwoochoi/torchaudio-contrib. In short, we wanted to re-design torch-based audio processing so that
- things can be
Layers
, which are based on correspondingFunctionals
- names for layers and arguments are carefully chosen
- all work for multi-channel
- complex numbers are supported when it makes sense (e.g., STFTs)
Review - layers
. torchaudio-contrib already covers lots of functions that transform.py
is covering now, but not all of them. And that's why I feel like it's time to discuss this here.
Let me list the classes in transform.py
one by one with some notes.
1. Already in torchaudio-contrib. Hoping we'd replace.
class Spectrogram
: we have it in torchaudio-contrib. On top of this, we also haveSTFT
layer which outputs complex representations (same astorch.stft
since we're wrapping it).class MelScale
: we have it and would like to suggest to change the name to something more general. We named itclass MelFilterbank
, assuming there can be other types of filterbanks, too. It also supportshtk
and non-htk
mel filterbanks.class SpectrogramToDB
: we would like to propose a more general approach --class AmplitudeToDb(ref=1.0, amin=1e-7)
andclass DbToAmplitude(ref=1.0)
, because decibel-scaling is about changing it's unit, not the core content of the input.class MelSpectrogram
: we have it, which returns ann.Sequential
model consists of Spectrogram and mel-scale filter bank.class MuLawEncoding
,class MuLawExpanding
: we have it, actually a 99% copy of the implementation here.
2. Wouldn't need these
class Compose
: we wouldn't need it because once things are based onLayers
people can simply build ann.Sequential()
.class Scale
: It does16int
-->float
. I think we need to deprecate this because if we really need this, it should be with a more intuitive and precise name, and probably should support other conversions as well.
3. To-be-added
class DownmixMono
: I would like to have one. But we also consider having a time-frequency representation-based downmix (energy-preserving operation) (@faroit). I'm open for discussion. Personally I'd prefer to have separate classes,DownmixWaveform()
andDownmixSpecgram()
. Maybe until we have a better one, we should keep it as it is.class MFCC
: we currently don't have it. The current torch/audio implementation usess2db (SpectrogramToDB)
, but this class seems little arbitrary for me, so we might want to re-implement it.
4. Not sure about these
class PadTrim
: I don't actually know why we need it exactly, would love to hear about this!class LC2CL
: So far, torchaudio-contrib code hasn't consideredchannel-first
tensors. If it's a thing, we'd i) update our code to make them compatible and ii) have the same or a similar class to this. But, ..do we really need this?class BLC2CBL
: same asLC2CL
-- I'd like to know its use cases.
Review - argument and variable names
As summarised --> keunwoochoi/torchaudio-contrib#46, we'd like to use
waveforms
for a batch of waveformsreal_specgrams
for magnitude spectrogramscomplex_specgrams
for complex spectrograms
. (This is relatively less-discussed).
Audio loading
@faroit has been working on replacing Sox with others. But here in this issue, I'd like to focus on the topics above.
So,
- Any opinion on this?
- Any answers to the questions I have!
- If it looks good, what else would you like to have in the one-shot PR that would replace the current
transforms.py
?