Generate precise word-level timings from audio and text input. Built using torchaudio's MMS model.
This model aligns audio with text to generate word-level timings, useful for:
- Generating accurate subtitles/captions
- Creating word-level audio segmentation
- Synchronizing text with audio
Try it out on Replicate!
- Install Cog:
curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
chmod +x /usr/local/bin/cog
- Run predictions:
cog predict -i audio=@audio.mp3 -i script="Your transcript here"
- Push to Replicate:
cog push r8.im/username/forced-alignment
MIT License