fast_align
is a simple, fast, unsupervised word aligner.
If you use this software, please cite:
- Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proc. of NAACL.
The source code in this repository is provided under the terms of the Apache License, Version 2.0.
A variant of fast_align
is included in the cdec
translation system. It uses the same model and produces identical alignments, but it has a few extra features for online alignment with pre-built models.
Input to fast_align
must be tokenized and aligned into parallel sentences. Each line is a source language sentence and its target language translation, separated by a triple pipe symbol with leading and trailing white space (|||
). An example is as follows.
doch jetzt ist der Held gefallen . ||| but now the hero has fallen .
neue Modelle werden erprobt . ||| new models are being tested .
doch fehlen uns neue Ressourcen . ||| but we lack new resources .
fast_align
requires only a C++ compiler; it can be compiled by typing make
at the command line prompt. Run fast_align
to see a list of command line options.
The usual way to run fast_align
to generate source–target alignments is:
./fast_align -i text.fr-en -d -o -v > forward.align
The usual way to run fast_align
to generate target–source alignments is:
./fast_align -i text.fr-en -d -o -v -r > reverse.align
fast_align
produces outputs in the i-j
"Pharaoh" format, where a pair i-j
indicates that the ith word of the source is aligned to the jth word of the target sentence. For example, a good alignment of the above example corpus would be:
0-0 1-1 2-4 3-2 4-3 5-5 6-6
0-0 1-1 2-2 2-3 3-4 4-5
0-0 1-2 2-1 3-3 4-4 5-5
The development of this software was sponsored by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533.