Description
For release 2.0, we plan to add support for multi-channel room impulse response simulation methods under torchaudio.functional
. The implementation is based on pyroomacoustics, that supports both "image source method", and "image source + ray tracing" (hybrid) method. We will support both modes in two separate methods.
Diagram
Here is the diagram of how the code works:
Both methods compute image sources as the first step. The difference is, for pure image source method, only absorption_coefficient
is used to estimate the attenuations for each order. while for the hybrid method, both absorption_coefficient
and scattering
are used, if scattering
is provided by users. Then the image sources locations are applied to estimate impulse responses (IR) in _build_rir
method. The hybrid method applies ray tracing to estimate IRs for late reverberations.
Besides the above two methods, we plan to support an Array Transfer Function (ATF) based simulation method, here is the diagram:
The first few steps are the same as simulate_rir_ism
or simulate_rir_hybrid
, depending on the mode it selects.
API Design
The API of simulate_rir_ism
will be like:
simulate_rir_ism(
room : Tensor,
mic_array : Tensor,
source : Tensor,
sample_rate : int,
max_order: int,
wall_material : str = “”,
ceiling_material : str = “”,
floor_material : str = “”,
air_absorption : bool = False,
temperature: float = None,
humidity: float = None,
) -> Tensor
where room
is a 1D Tensor with D
values that represents the room size, where D depend on whether the room is a 2D or 3D room.
mic_array
is a 2D Tensor with dimensions (channel, D)
, representing the coordinates of microphones in an array.
source
is a 1D Tensor with D
values which represents the coordinates of the sound source.
sample_rate
is an integer to decide the sample rate of simulated RIRs.
max_order
is the maximum order of wall reflections to save the computation in image source method.
temperature
and humidity
are parameters to compute the sound speed, by default the sound speed is 343 m/s.
The returned Tensor is a 2D Tensor with dimensions (channel, max_rir_length)
. channel
is number of microphones in the array. given max_order
, we compute the maximum distance d_max
of all qualified image sources to the microphone array, then max_rir_length
is computed by d_max / C * sample_rate + filter_len
, where C is the sound speed, filter_len
is the filter length in impulse response simulation.
material
is the most tricky argument. In pyroomacoustics
, it can accept a single floating-point value, assuming it is same for all 6 walls (4 walls + ceiling + floor), or a dictionary of materials where each wall has a different absorption coefficient. In the most extreme case, it is a dictionary of 6 materials, where each material has a list of absorption coefficients that is for a specific center frequency, in such case, we should also provide the list of center frequencies to compute the attenuations.
Based on the above use cases, there are two possible APIs for the materials:
Option 1
Give limited str
choices to the wall, ceiling, and floor.
The input arguments will be wall_material
, ceiling_material
, and floor_material
, respectively. The options can be found in https://github.com/LCAV/pyroomacoustics/blob/master/pyroomacoustics/data/materials.json, that record the coefficients of real materials.
The shortcoming of the method, is that it can't be differentiable, if users want to estimate the absorption coefficients via a neural network.
Option 2
Use absorption
and center_frequency
as the input argument, the type will be Union[float, Tensor]
.
- In
float
case, it assumes the coefficient is same for all walls. - In
Tensor
case, there are two possible use case- if it is a 1D Tensor, the shape should be
(4,)
(2D room) or(6,)
(3D room), meaning each wall has its own coefficient. - if it is a 2D Tensor, the shape should be
(num_bands, 4)
or(num_bands, 6)
, wherenum_bands
refers to the number of center frequencies.center_frequency
should also be provided in such case.
- if it is a 1D Tensor, the shape should be
The shortcoming of this option, is that it can accept unrealistic materials that don't exist (the best we can do is make sure the coefficients are smaller than 1). The advantage, is the module can be differentiable, i.e., passing the room size, source location, along with the coefficients as input, and generate the RIRs as the output.
We would like to hear users' feedback, to decide how to proceed the API design.