The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language. NAACL 2024.
We are gradually releasing the data and code. Thank you for your patience.
git clone
cd clap-ipa
pip install .
from clap.encoders import *
import torch.nn.functional as F
from transformers import DebertaV2Tokenizer, AutoProcessor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
speech_encoder = SpeechEncoder.from_pretrained('anyspeech/clap-ipa-tiny-speech')
phone_encoder = PhoneEncoder.from_pretrained('anyspeech/clap-ipa-tiny-phone')
tokenizer = DebertaV2Tokenizer.from_pretrained('charsiu/IPATokenizer')
processor = AutoProcessor.from_pretrained('openai/whisper-tiny')
audio_input = processor(some_audio)
ipa_input = tokenizer(some_ipa_string)
with torch.no_grad():
speech_embed = speech_encoder(audio_input)
phone_embed = phone_encoder(ipa_input)
similarity = F.cosine_similarity(speech_embed,phone_embed,dim=-1)
For IPA-Aligner
from clap.encoders import *
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
speech_encoder = SpeechEncoder.from_pretrained('anyspeech/clap-ipa-tiny-speech')
phone_encoder = PhoneEncoder.from_pretrained('anyspeech/clap-ipa-tiny-phone')
Forced-alignment code is in evaluate/
. This aligner will be incorported into charsiu in coming months.
For training, you can download data from HuggingFace hub. Then sample train/val filelists are available in data/
python -c config/clap_ipa/base.yaml
Evaluation code is available in evaluate
. Each evalaute code script has almost the same organization, so you can simply pass the .ckpt
checkpoint after training to evaluate their performance. Please check the evalaution code for usage.
python --data ucla --checkpoint "last.ckpt"
Model | Phone Encoder | Speech encoder |
CLAP-IPA-tiny | anyspeech/clap-ipa-tiny-phone |
anyspeech/clap-ipa-tiny-speech |
CLAP-IPA-base | anyspeech/clap-ipa-base-phone |
anyspeech/clap-ipa-base-speech |
CLAP-IPA-small | anyspeech/clap-ipa-small-phone |
anyspeech/clap-ipa-small-speech |
IPA-Aligner-tiny | anyspeech/ipa-align-tiny-phone |
anyspeech/ipa-align-tiny-speech |
IPA-Aligner-base | anyspeech/ipa-align-base-phone |
anyspeech/ipa-align-base-speech |
IPA-Aligner-small | anyspeech/ipa-align-small-phone |
anyspeech/ipa-align-base-speech |
All datasets are distributed as wds
files on huggingface hub.
After this study, we found that these datasets still contain inconsistent unicode encoding of IPA symbols.
A cleaner version will be released when we finish another round of data cleaning.
from huggingface_hub import snapshot_download
snapshot_download(repo_id="anyspeech/fleurs_ipa", repo_type="dataset", local_dir="your_own_folder",local_dir_use_symlinks=False,resume_download=False,max_workers=4)
import webdataset as wds # Note the typical import shorthand
dataset = (
wds.WebDataset("data-archives/shard-00{00...24}.tar") # 25 shards
.decode() # Automagically decode files
.shuffle(size=1000) # Shuffle on-the-fly in a buffer
.batch(batchsize=10) # Create batches
Jian Zhu, Changbing Yang, Farhan Samir, and Jahurul Islam. 2024. The taste of IPA: Towards open-vocabulary keyword
spotting and forced alignment in any language. In Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),
pages 750–772, Mexico City, Mexico. Association for Computational Linguistics.