Room Impulse Response Simulation Support in TorchAudio

For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under `torchaudio.functional`. The implementation is based on [pyroomacoustics](https://github.com/LCAV/pyroomacoustics), that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.

## Diagram

Here is the diagram of how the code works:
<img width="756" alt="image" src="https://user-images.githubusercontent.com/8653221/184940958-c0b74798-baa4-4c34-9674-b82d276934e2.png">

Both methods compute image sources as the first step. The difference is, for pure image source method, only `absorption_coefficient` is used to estimate the attenuations for each order. while for the hybrid method, both `absorption_coefficient` and `scattering` are used, if `scattering` is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in `_build_rir` method. The hybrid method applies ray tracing to estimate IRs for late reverberations.

Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:
<img width="804" alt="image" src="https://user-images.githubusercontent.com/8653221/184942245-92c70293-15c4-4281-999d-ce02bd0f54e9.png">
The first few steps are the same as `simulate_rir_ism` or `simulate_rir_hybrid`, depending on the mode it selects. 

## API Design
The API of `simulate_rir_ism` will be like:
```python
simulate_rir_ism(
    room : Tensor,
    mic_array : Tensor,
    source : Tensor,
    sample_rate : int,
    max_order: int,
    wall_material : str = “”,
    ceiling_material : str = “”,
    floor_material : str = “”,
    air_absorption : bool = False,
    temperature: float = None,
    humidity: float = None,
) -> Tensor
```
where `room` is a 1D Tensor with `D` values that represents the room size, where D depend on whether the room is a 2D or 3D room.
`mic_array` is a 2D Tensor with dimensions `(channel, D)`, representing the coordinates of microphones in an array.
`source` is a 1D Tensor with `D` values which represents the coordinates of the sound source.
`sample_rate` is an integer to decide the sample rate of simulated RIRs.
`max_order` is the maximum order of wall reflections to save the computation in image source method.
`temperature` and `humidity` are parameters to compute the sound speed, by default the sound speed is 343 m/s.

The returned Tensor is a 2D Tensor with dimensions `(channel, max_rir_length)`. `channel` is number of microphones in the array. given `max_order`, we compute the maximum distance `d_max` of all qualified image sources to the microphone array, then `max_rir_length` is computed by `d_max / C * sample_rate + filter_len`, where C is the sound speed, `filter_len` is the filter length in impulse response simulation. 

`material` is the most tricky argument. In `pyroomacoustics`, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.

Based on the above use cases, there are two possible APIs for the materials:

## Option 1
Give limited `str` choices to the wall, ceiling, and floor.
The input arguments will be `wall_material`, `ceiling_material`, and `floor_material`, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials.
The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.

## Option 2
Use `absorption` and `center_frequency` as the input argument, the type will be `Union[float, Tensor]`.
- In `float` case, it assumes the coefficient is same for all walls.
- In `Tensor` case, there are two possible use case
  - if it is a 1D Tensor, the shape should be `(4,)` (2D room) or `(6,)` (3D room), meaning each wall has its own coefficient.
  - if it is a 2D Tensor, the shape should be `(num_bands, 4)` or `(num_bands, 6)`, where `num_bands` refers to the number of center frequencies. `center_frequency` should also be provided in such case.

The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.

We would like to hear users' feedback, to decide how to proceed the API design. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Room Impulse Response Simulation Support in TorchAudio #2624

Diagram

API Design

Option 1

Option 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Room Impulse Response Simulation Support in TorchAudio #2624

Description

Diagram

API Design

Option 1

Option 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions