Skip to content

Simple, fast unsupervised word aligner

License

Notifications You must be signed in to change notification settings

zaemyung/fast_align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast_align

fast_align is a simple, fast, unsupervised word aligner.

If you use this software, please cite:

The source code in this repository is provided under the terms of the Apache License, Version 2.0.

A variant of fast_align is included in the cdec translation system. It uses the same model and produces identical alignments, but it has a few extra features for online alignment with pre-built models.

Input format

Input to fast_align must be tokenized and aligned into parallel sentences. Each line is a source language sentence and its target language translation, separated by a triple pipe symbol with leading and trailing white space (|||). An example 3-sentence German–English parallel corpus is:

doch jetzt ist der Held gefallen . ||| but now the hero has fallen .
neue Modelle werden erprobt . ||| new models are being tested .
doch fehlen uns neue Ressourcen . ||| but we lack new resources .

Compiling and using fast_align

Building fast_align requires only a C++ compiler; this can be done by typing make at the command line prompt. Run fast_align to see a list of command line options.

fast_align generates asymmetric alignments (i.e., by treating either the left or right language in the parallel corpus as primary language being modeled, slightly different alignments will be generated). The usually recommended way to generate source–target (left language–right language) alignments is:

./fast_align -i text.fr-en -d -o -v > forward.align

The usually recommended way to run fast_align to generate target–source alignments is:

./fast_align -i text.fr-en -d -o -v -r > reverse.align

Output

fast_align produces outputs in the i-j "Pharaoh" format, where a pair i-j indicates that the ith word of the left language (by convention, the "source") is aligned to the jth word of the right sentence (by convention, the "target"). For example, a good alignment of the above example corpus would be:

0-0 1-1 2-4 3-2 4-3 5-5 6-6
0-0 1-1 2-2 2-3 3-4 4-5
0-0 1-2 2-1 3-3 4-4 5-5

Acknowledgements

The development of this software was sponsored in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533.

About

Simple, fast unsupervised word aligner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 92.0%
  • Python 6.0%
  • CMake 2.0%