Skip to content

Room Impulse Response Simulation Support in TorchAudio #2624

Open
@nateanl

Description

@nateanl

For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under torchaudio.functional. The implementation is based on pyroomacoustics, that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.

Diagram

Here is the diagram of how the code works:
image

Both methods compute image sources as the first step. The difference is, for pure image source method, only absorption_coefficient is used to estimate the attenuations for each order. while for the hybrid method, both absorption_coefficient and scattering are used, if scattering is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in _build_rir method. The hybrid method applies ray tracing to estimate IRs for late reverberations.

Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:
image
The first few steps are the same as simulate_rir_ism or simulate_rir_hybrid, depending on the mode it selects.

API Design

The API of simulate_rir_ism will be like:

simulate_rir_ism(
    room : Tensor,
    mic_array : Tensor,
    source : Tensor,
    sample_rate : int,
    max_order: int,
    wall_material : str = “”,
    ceiling_material : str = “”,
    floor_material : str = “”,
    air_absorption : bool = False,
    temperature: float = None,
    humidity: float = None,
) -> Tensor

where room is a 1D Tensor with D values that represents the room size, where D depend on whether the room is a 2D or 3D room.
mic_array is a 2D Tensor with dimensions (channel, D), representing the coordinates of microphones in an array.
source is a 1D Tensor with D values which represents the coordinates of the sound source.
sample_rate is an integer to decide the sample rate of simulated RIRs.
max_order is the maximum order of wall reflections to save the computation in image source method.
temperature and humidity are parameters to compute the sound speed, by default the sound speed is 343 m/s.

The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length). channel is number of microphones in the array. given max_order, we compute the maximum distance d_max of all qualified image sources to the microphone array, then max_rir_length is computed by d_max / C * sample_rate + filter_len, where C is the sound speed, filter_len is the filter length in impulse response simulation.

material is the most tricky argument. In pyroomacoustics, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.

Based on the above use cases, there are two possible APIs for the materials:

Option 1

Give limited str choices to the wall, ceiling, and floor.
The input arguments will be wall_material, ceiling_material, and floor_material, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials.
The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.

Option 2

Use absorption and center_frequency as the input argument, the type will be Union[float, Tensor].

  • In float case, it assumes the coefficient is same for all walls.
  • In Tensor case, there are two possible use case
    • if it is a 1D Tensor, the shape should be (4,) (2D room) or (6,) (3D room), meaning each wall has its own coefficient.
    • if it is a 2D Tensor, the shape should be (num_bands, 4) or (num_bands, 6), where num_bands refers to the number of center frequencies. center_frequency should also be provided in such case.

The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.

We would like to hear users' feedback, to decide how to proceed the API design.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions