This repository contains the implementation of the paper:
STTATTS: Unified Speech-To-Text and Text-ToSpeech Model
MBZUAI
EMNLP 2024 (findings)
- Oct 2024: release preprint in arxiv
Finetuned checkpoints are available for Arabic and English-small. To finetune on your dataset, download pretrained checkpoints,tokenizer and dict from ArTST and SpeechT5.
See finetune scripts here. Installation and Inference follows ArTST repo.
STTATTS is built on ArTST and SpeechT5. If you use any of STTATTS models, please cite the papers:
@misc{toyin2024sttattsunifiedspeechtotexttexttospeech,
title={STTATTS: Unified Speech-To-Text And Text-To-Speech Model},
author={Hawau Olamide Toyin and Hao Li and Hanan Aldarmaki},
year={2024},
eprint={2410.18607},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.18607},
}
@inproceedings{toyin2023artst,
title={ArTST: Arabic Text and Speech Transformer},
author={Toyin, Hawau and Djanibekov, Amirbek and Kulkarni, Ajinkya and Aldarmaki, Hanan},
booktitle={Proceedings of ArabicNLP 2023},
pages={41--51},
year={2023}
}
@article{ao2021speecht5,
title={Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing},
author={Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and others},
journal={arXiv preprint arXiv:2110.07205},
year={2021}
}