Skip to content

Simple, fast unsupervised word aligner

License

Notifications You must be signed in to change notification settings

zaemyung/fast_align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fast_align

fast_align is a simple, fast, unsupervised word aligner.

If you use this software, please cite:

The source code in this repository is provided under the terms of the Apache License, Version 2.0.

A variant of fast_align is included in the cdec translation system. It uses the same model and produces identical alignments, but it has a few extra features for online alignment with pre-built models.

Input format

Input to fast_align must be tokenized and aligned into parallel sentences. Each line is a source language sentence and its target language translation, separated by a triple pipe symbol with leading and trailing white space (|||). An example is as follows.

doch jetzt ist der Held gefallen . ||| but now the hero has fallen .
neue Modelle werden erprobt . ||| new models are being tested .
doch fehlen uns neue Ressourcen . ||| but we lack new resources .

Compiling and using fast_align

fast_align requires only a C++ compiler; it can be compiled by typing make at the command line prompt. Run fast_align to see a list of command line options.

The usual way to run fast_align to generate source–target alignments is:

./fast_align -i text.fr-en -d -o -v > forward.align

The usual way to run fast_align to generate target–source alignments is:

./fast_align -i text.fr-en -d -o -v -r > reverse.align

Output

fast_align produces outputs in the i-j "Pharaoh" format, where a pair i-j indicates that the ith word of the source is aligned to the jth word of the target sentence. For example, a good alignment of the above example corpus would be:

0-0 1-1 2-4 3-2 4-3 5-5 6-6
0-0 1-1 2-2 2-3 3-4 4-5
0-0 1-2 2-1 3-3 4-4 5-5

Acknowledgements

The development of this software was sponsored by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533.

About

Simple, fast unsupervised word aligner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 92.0%
  • Python 6.0%
  • CMake 2.0%