fast-audiomentations
is a Python library that leverages GPU acceleration for efficient audio augmentation,
suitable for high-throughput audio analysis and machine learning applications.
- Performance Showcase: Highlights the advantages of using GPU for audio processing tasks.
- Diverse Audio Augmentations: Offers a variety of transformations including noise addition, filtering, and gain adjustments, optimized for GPU performance.
- User-Friendly API: Designed for ease of use, enabling quick integration into audio processing pipelines.
- Triton-based GPU code: Triton enables maximized GPU utilization. Although real-world examples of Triton for straightforward kernels are limited, this library serves as an invaluable practical study resource.
Currently, there is no ready-to-use package on PyPI, so you need to:
git clone https://github.com/Lallapallooza/fast-audiomentations
cd fast-audiomentations
python -m pip install -r requirements.txt
python -m examples.test
Assuming "SUCCESS" is printed, the library is now ready to use.
Examples can be found in the examples/
and benchmark/
directories.
Here's an example of how to apply background noise addition to an audio sample
using fast-audiomentations
:
from fast_audiomentations import AddBackgroundNoise
import torch
batch_size = 128
dataloader = AddBackgroundNoise.get_dali_dataloader(
'some/path/to/noises/in/dali/format',
buffer_size=batch_size,
n_workers=4
)
add_background_noise = AddBackgroundNoise(
noises_dataloader=dataloader,
min_snr=-10,
max_snr=10,
buffer_size=batch_size,
p=1.0
)
sample_rate = 48000
audio = load_my_audio_batch() # load as torch.Tensor
audio_lens = torch.tensor([audio.shape[0] for i in range(batch_size)], device='cuda')
# torch.Tensor
audios_with_noise = add_background_noise(
samples=audio,
samples_lens=audio_lens,
sample_rate=sample_rate
)
To validate the hypothesis that GPU can accelerate audio augmentation processing, benchmarks were conducted in comparison with two libraries: audiomentations and torch-audiomentations.
Details about the specific batch sizes, augmentation types, and parameters can be found in the code.
See the benchmark/
directory.
For my configuration:
- NVIDIA GeForce RTX 3090 Ti
- 64GB RAM
- 12th Gen Intel(R) Core(TM) i9-12900KF
- Samsung 980 PRO
The results are available in the benchmark_results_local/
directory.
To run benchmarks, you need to:
- Change the parameters in
benchmark/benchmark_data.py
. - Run the command:
python -m benchmark.run_all
.
Observations show that for stateless operations (no IO required), the performance increase is significant (dozens of times). However, for augmentations like adding background noise, the performance benefit is less significant due to IO/PCIe overhead. Even utilizing DALI does not change the situation much, as the GPU is heavily underutilized. Nonetheless, it's observed that the NVMe device is not fully utilized, suggesting there might be some code inefficiency or a limitation in Python's IO performance.
- Add more augmentations (RIR convolve, SpecAug, etc.). We welcome requests for additional augmentations and will endeavor to implement them promptly.
- Rewrite the IO process for stateful cases (RIR, add_background_noise) to fully utilize IO capabilities.
- Conduct benchmarks with network devices for stateful scenarios to demonstrate that asynchronous data reading may reduce latency.
- Support XLA backend
- Add wrappers like
Compose
,OneOf
, etc...