Skip to content

Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate

License

Notifications You must be signed in to change notification settings

hubertsiuzdak/snac

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNAC 🍿

Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate.

🎸 Music samples 🗣️ Speech samples
snac-audio-samples.mp4
speech-samples.mp4

🎧 More audio samples available at https://hubertsiuzdak.github.io/snac/

Overview

SNAC encodes audio into hierarchical tokens similarly to SoundStream, EnCodec, and DAC (see the image on the left). However, SNAC introduces a simple change where coarse tokens are sampled less frequently, covering a broader time span (see the image on the right).

This can not only save on bitrate, but more importantly this might be very useful for language modeling approaches to audio generation. E.g. with coarse tokens of ~10 Hz and a context window of 2048 you can effectively model a consistent structure of an audio track for ~3 minutes.

snac.png

Pretrained models

Currently, all models support only single audio channel (mono).

Model Bitrate Sample Rate Params Recommended use case
hubertsiuzdak/snac_24khz 0.98 kbps 24 kHz 19.8 M 🗣️ Speech
hubertsiuzdak/snac_32khz 1.9 kbps 32 kHz 54.5 M 🎸 Music / Sound Effects
hubertsiuzdak/snac_44khz 2.6 kbps 44 kHz 54.5 M 🎸 Music / Sound Effects

Usage

Install it using:

pip install snac

To encode (and decode) audio with SNAC in Python, use the following code:

import torch
from snac import SNAC

model = SNAC.from_pretrained("hubertsiuzdak/snac_32khz").eval().cuda()
audio = torch.randn(1, 1, 32000).cuda()  # placeholder for actual audio with shape (B, 1, T)

with torch.inference_mode():
    codes = model.encode(audio)
    audio_hat = model.decode(codes)

You can also encode and reconstruct in a single call:

with torch.inference_mode():
    audio_hat, codes = model(audio)

⚠️ Note that codes is a list of token sequences of variable lengths, each corresponding to a different temporal resolution.

>>> [code.shape[1] for code in codes]
[12, 24, 48, 96]

Acknowledgements

Module definitions are adapted from the Descript Audio Codec.