Replies: 8 comments 22 replies
-
@eonglints thank you for this! as you can tell, i'm not very familiar with the audio domain (but plan to deepen my knowledge in this arena) i will plan on incorporating hubert intermediate features then! do you think the additional detail of normalizing across each dimension is very important or inconsequential? |
Beta Was this translation helpful? Give feedback.
-
@eonglints also, while i have you on the line, what do you think about this paper is it still worth implementing given the results of audiolm? my take is that perhaps with the right conditioning added to audiolm, the architecture would obsolete the prior work, but i'm not well read in the field enough to know for sure thank you in advance! |
Beta Was this translation helpful? Give feedback.
-
@eonglints only the smallest Hubert has its quantizer released https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/README.md#hubert |
Beta Was this translation helpful? Give feedback.
-
Just another little note about the semantic tokens - in the paper they mention in IV. B "in the first two stages [semantic and course acoustic], we follow the previously proposed practice of removing consecutive repetitions of the semantic tokens [14]" In 14 they state "we found that removing sequential repetitions of units improves performance, hence we apply it universally." Given these statements, we should probably look to include this step. It's obviously fairly easy to implement using torch.unique_consecutive and the way the FAIR researchers implement it looks like this. Audio LM doesn't use the counts of the repetitions so there would be no need to return them. |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm digging in the audio quantization, and this discussion is very helpful. Thanks! vq-wav2vec is used in vq_wav2vec.py, but I can't figure out how it works. codebook_indices is a list of tuples of integers:
like this:
|
Beta Was this translation helpful? Give feedback.
-
Just to say that I'm successfully training the semantic tokens transformer right now using the LJ dataset and mHuBERT 1000 tokens (these tokens were generated offline before-hand, so I haven't tested in-line token extraction). I just copied the training script and autoregressive wrapper from PaLM-pytorch and made a couple of minor changes as required. I was going to put in an issue with the |
Beta Was this translation helpful? Give feedback.
-
Hey, so I was wondering what you think @lucidrains about the placement of the semantic token extraction and de-duplication of consecutive tokens. I can see from your recent commits that we have to deal with the fact that the sequence lengths will be variable because of unique consecutive. Just a thought - and this may be against your general design philosophy - would it make more sense to move the semantic token extraction and unique consecutive de-duplication to a dataloader rather than as part of the SemanticTransformer forward pass? That way, the sequence length will be fixed and we can also set up the dataloader to use pre-computed semantic tokens if they've already been computed in a previous run. Anyway, just a thought. |
Beta Was this translation helpful? Give feedback.
-
Hey, I wonder if the semantic token from the released 16khz Hubert also supports 24khz sound synthesis, or the 24kHz hubert is needed? |
Beta Was this translation helpful? Give feedback.
-
Obviously the semantic tokens in the original paper are difficult to reproduce because w2v-BERT XL is closed source. However, I'm not sure wav2vec 2.0 features are the most appropriate replacement. AudioLM was heavily influenced by Text-Free Prosody-Aware Generative Spoken Language Modeling (paper, code) (indeed the lead author of that paper left FAIR, joined Google and contributed to AudioLM).
In that paper, they use clustered features from an intermediate layer (layer 6 I believe) of HuBERT (paper, code, HF). This model and this clustering approach has been used for many speech-related papers in the last year (including several from FAIR. If I have time, I'll find some references. But for phoneme recognition, for example, HuBERT does a better job than wav2vec 2.0 (see this leaderboard. If you're wondering, WavLM is essentially HuBERT with more data and a couple more tricks. It's also available. However, there are more papers using HuBERT).
Facebook even released a small library (which you may well already be aware of) that makes it easy to extract clustered features using pre-trained HuBERT and pre-trained k-means models. The only issue is that most of the clustering models have considerably fewer centroids than AudioLM, (the exception being the mHuBERT 1000 model).
Anyway, I guess all this to say that perhaps HuBERT (or potentially WavLM) would be a better base for semantic tokens than wav2vec 2.0. Unfortunately, all three models have a higher frame-rate (50Hz) than w2v-BERT (25Hz) which will make the sequences longer.
Thanks so much for your work on this!
Beta Was this translation helpful? Give feedback.
All reactions