Subtitle Alignment

About

Applying state-of-the-art sentence alignment tools to subtitle extraction and alignment, achieving a substantial improvement in subtitle alignment quality. Leveraging sentence embeddings, dynamic programming, cosine similarity, and partitioning we attained F1 scores exceeding 93% and estimate an overall improvement of 31% based on other subtitle alignment techniques.

Scripts for preprocessing and aligning subtitles

The directory scripts has all the necessary tools to import subtitles from a directory structure provided by OpenSubtitles, align sentences, run evaluation against gold-labeled alignments, etc. See README in scripts directory for usage.

Gold Standard Subtitle Alignments

There are gold alignments for 5 titles in the gold directory. The alignments can be found within each subdirectory with names like eng-spa-gold.txt and eng-ger-gold.txt. The subtitles themselves are in the sub directories eng, spa, ger, etc.

SubAlign Annotation tool

There is a curses and python implementation of an annotation tool. You must first run the script (scripts/run_vecalign.py) on the title you want to annotate in order to generate the hypothesis alignments. Then the annotation tool (scripts/annotator.py) will load those alignments into a vim-like editor where you can approve, edit or delete them. This tool supports the following operations:

Key	Action
d	Delete current alignment.
e	Edit current alignment. Will open the current alignment in Vim.
u	Union (merge) current subtitle with the following subtitle
s	Split alignment into two. This will actually duplicate the current alignment allowing you to edit it and the subsequent (duplicate). Ideal for splitting alignments when multiple sentencese have been merged together.
w	Write (save) all alignments including those that have not yet been reviewed.
n	Move to Next alignment.
p	Move to Previous alignment.

Captura de pantalla 2024-11-10 a la(s) 11 34 29

What alignments look like

Once alignments are generated they will look like the following (2 separate files where the line numbers correspond):

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
gold		gold
language_model		language_model
results		results
scripts		scripts
spm		spm
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Subtitle Alignment

About

Scripts for preprocessing and aligning subtitles

Gold Standard Subtitle Alignments

SubAlign Annotation tool

What alignments look like

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

joshstephenson/SEAS

Folders and files

Latest commit

History

Repository files navigation

Subtitle Alignment

About

Scripts for preprocessing and aligning subtitles

Gold Standard Subtitle Alignments

SubAlign Annotation tool

What alignments look like

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages