Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING: add support for targets as discussed in #3 #123

Merged
merged 15 commits into from
Apr 1, 2022

Conversation

hbredin
Copy link
Collaborator

@hbredin hbredin commented Mar 22, 2022

This is an initial attempt at addressing #3.
The proposed API assumes the following shapes:

  • samples: (batch_size, num_channels, num_samples)
  • targets: (batch_size, num_channels, num_frames, num_classes)
Input Target num_frames num_classes
Audio class(es), fixed length, not correlated with the input length. Low priority. 1 1 or more
Audio Time series of class(es). more than 1 1 or more
Audio Audio (length same as the input). num_samples 1

and is used like that

augment = MixSamples()
augmented = augment(samples, targets=targets)
mixed_samples = augmented.samples
mixed_targets = augmented.targets

@hbredin
Copy link
Collaborator Author

hbredin commented Mar 24, 2022

I have just added a Mix transform as an example.

@iver56
Copy link
Collaborator

iver56 commented Mar 28, 2022

A good next step would be to have it return an ObjectDict (see #126) instead of tensor or tuple

@hbredin
Copy link
Collaborator Author

hbredin commented Mar 28, 2022

Will do.

@hbredin
Copy link
Collaborator Author

hbredin commented Mar 29, 2022

What do you think of automatically inferring target_rate from num_samples = samples.shape[2], sample_rate, and num_frames == target.shape[2]?

target_rate = sample_rate * num_frames / num_samples

with the num_frames = 1 corner case where target_rate does not even exist (one label for the whole sample).

@iver56
Copy link
Collaborator

iver56 commented Mar 29, 2022

I would prefer to have it explicitly provided by the user. Here's an example that explains why:

Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.

@hbredin
Copy link
Collaborator Author

hbredin commented Mar 29, 2022

I would prefer to have it explicitly provided by the user. Here's an example that explains why:

Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.

Noted.

When target_rate is required but missing, can we still do our best to infer it automatically and warn the user about it?

warnings.warn(
    f"target_rate is required by {self.__class__.__name__}. "
    f"It has been automatically inferred from targets shape to {inferred_target_rate} "
    f"If this is incorrect, please use target_rate to pass it directly.")

@iver56
Copy link
Collaborator

iver56 commented Mar 29, 2022

Yes, I'm okay with that

@hbredin
Copy link
Collaborator Author

hbredin commented Mar 29, 2022

Updated with the API discussed in Slack.

@hbredin hbredin marked this pull request as ready for review March 29, 2022 12:35
@hbredin hbredin changed the title wip: add support for targets as discussed in #3 BREAKING: add support for targets as discussed in #3 Mar 29, 2022
@iver56
Copy link
Collaborator

iver56 commented Mar 30, 2022

Thanks :) I'll have a look soon-ish. Thanks for your patience.

@hbredin hbredin mentioned this pull request Mar 30, 2022
@iver56
Copy link
Collaborator

iver56 commented Mar 31, 2022

This is starting to look quite good already!

Here's what I think needs to be done:

Before it gets merged:

  • Mix deserves a unit test or two. Can you make that happen @hbredin?

After it gets merged:

  • Add an output_format parameter that defaults to the old-style API, so we can release at least one version with backwards-compatible API (and then have the new API style be opt-in)
  • Have a demo script that processes targets as well as the usual inputs. Maybe also document some examples of how to use the targets feature.
  • Update changelog

@iver56 iver56 mentioned this pull request Mar 31, 2022
@hbredin
Copy link
Collaborator Author

hbredin commented Mar 31, 2022

Before it gets merged:

* Mix deserves a unit test or two. Can you make that happen @hbredin?

I definitely can (and will) but I am having trouble thinking about what to test.

I can do the equivalent of AddBackgroundNoise's test_varying_snr_within_batch.
Do you think of a test focusing on targets?

@iver56
Copy link
Collaborator

iver56 commented Apr 1, 2022

I definitely can (and will) but I am having trouble thinking about what to test.

I can do the equivalent of AddBackgroundNoise's test_varying_snr_within_batch. Do you think of a test focusing on targets?

At the very least I like to have the shape and the dtype of the output tested. And that the output isn't the same as the output. But of course, smart test assertions are useful 👍

@iver56 iver56 merged commit 7bc37e5 into asteroid-team:master Apr 1, 2022
@iver56
Copy link
Collaborator

iver56 commented Apr 1, 2022

Thanks for the contribution 🚀

@hbredin
Copy link
Collaborator Author

hbredin commented Apr 1, 2022

Thanks for merging this!

After it gets merged:

* Add an output_format parameter that defaults to the old-style API, so we can release at least one version with backwards-compatible API (and then have the new API style be opt-in)

I can take care of it but can you please clarify the API?

* Have a demo script that processes targets as well as the usual inputs. Maybe also document some examples of how to use the targets feature

I can easily create a notebook showing how I use this new feature in pyannote.audio.
Would that be OK or do you want this demo to be 100% torch_audiomentation?

* Update changelog

I suggest you do that :)

@iver56
Copy link
Collaborator

iver56 commented Apr 1, 2022

I can take care of it but can you please clarify the API?

Great! Hmm, how about something like this, for example:

transform = PolarityInversion()
transform(my_audio)  # outputs a tensor, as before. Also emit a FutureWarning with info on the new output_type arg and a recommended on transitioning to object_dict

transform = PolarityInversion(output_type="tensor")
transform(my_audio)  # outputs a tensor, as before. Maybe also emit a deprecation warning?

transform = PolarityInversion(output_type="object_dict")
transform(my_audio)  # outputs an ObjectDict

I can easily create a notebook showing how I use this new feature in pyannote.audio.
Would that be OK or do you want this demo to be 100% torch_audiomentation?

Sounds good to me :)
If the notebook is large, consider using git LFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants