Applying state-of-the-art sentence alignment tools to subtitle extraction and alignment, achieving a substantial improvement in subtitle alignment quality. Leveraging sentence embeddings, dynamic programming, cosine similarity, and partitioning we attained F1 scores exceeding 93% and estimate an overall improvement of 31% based on other subtitle alignment techniques.
The directory scripts has all the necessary tools to import subtitles from a directory structure provided by OpenSubtitles, align sentences, run evaluation against gold-labeled alignments, etc. See README in scripts directory for usage.
There are gold alignments for 5 titles in the gold directory. The alignments can be found within each subdirectory with names like eng-spa-gold.txt and eng-ger-gold.txt. The subtitles themselves are in the sub directories eng, spa, ger, etc.
There is a curses and python implementation of an annotation tool. You must first run the script (scripts/run_vecalign.py) on the title you want to annotate in order to generate the hypothesis alignments. Then the annotation tool (scripts/annotator.py) will load those alignments into a vim-like editor where you can approve, edit or delete them. This tool supports the following operations:
| Key | Action |
|---|---|
| d | Delete current alignment. |
| e | Edit current alignment. Will open the current alignment in Vim. |
| u | Union (merge) current subtitle with the following subtitle |
| s | Split alignment into two. This will actually duplicate the current alignment allowing you to edit it and the subsequent (duplicate). Ideal for splitting alignments when multiple sentencese have been merged together. |
| w | Write (save) all alignments including those that have not yet been reviewed. |
| n | Move to Next alignment. |
| p | Move to Previous alignment. |
Once alignments are generated they will look like the following (2 separate files where the line numbers correspond):
