Measuring Neural Translation Difficulty by Cross-Mutual Information

This is the implementation of the approaches described in the paper:

Emanuele Bugliarello, Sabrina J. Mielke, Antonios Anastasopoulos, Ryan Cotterell and Naoaki Okazaki. It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, July 2020.

Requirements

You can clone this repository with submodules included issuing: git clone --recurse-submodules git@github.com:e-bug/nmt-difficulty

The requirements can be installed by setting up a conda environment:
conda env create -f environment.yml followed by source activate nmt

Data Preparation

The pre-processing steps to generate our data sets are as follows:

cd scripts/data
./download_data.sh
./preprocess_data.sh
./binarize.sh (for Fairseq)

You may want to update the default data directories used in the provided files.

Training and Evaluation

Scripts for training and evaluating each model are provided in scripts/experiments. You can easily run these scripts for each experiment by entering its directory (e.g. experiments/en2de) and running the corresponding script (e.g. ./test.sh).

Note that we trained our models on a SGE cluster but we also provide the associated Bash file (e.g. train_mt.sh).

Description of this repository

experiments/
Contains the following scripts to train and evaluate each model:
- train.sh: train the LM/MT model
- valid.sh: validate the model
- test.sh: test the model
- xmi_valid.sh (MT only): evaluate XMI on the validation set
- xmi_test.sh (MT only): evaluate XMI on the test set
fairseq-0.6.2/
Our code is based on Fairseq (version 0.6.2). Here, we introduce the following two files to evaluate our approximation of the cross-entropy of a model:
- eval_lm.py
- eval_mt.py
results/: collects CSV files aggregating the values of each evaluated metric
scripts/: main scripts, divided into the following subdirectories (you may want to update data and checkpoints directories in these files):
- data/: contains scripts for data generation
- experiments/: contains scripts for training and evaluating models
- results/: contains scripts to generate the CSV files in results/ as well as correlation coefficients and our bar plot.
tools/: third-party software (i.e. Moses and BPE)

License

This work is licensed under the MIT license. See LICENSE for details. Third-party software and data sets are subject to their respective licenses.
If you find our code/models or ideas useful in your research, please consider citing the paper:

@inproceedings{bugliarello-etal-2020-easier,
    title = "It{'}s Easier to Translate out of {E}nglish than into it: {M}easuring Neural Translation Difficulty by Cross-Mutual Information",
    author = "Bugliarello, Emanuele  and
      Mielke, Sabrina J.  and
      Anastasopoulos, Antonios  and
      Cotterell, Ryan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.149",
    pages = "1640--1649",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Measuring Neural Translation Difficulty by Cross-Mutual Information

Requirements

Data Preparation

Training and Evaluation

Description of this repository

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
experiments		experiments
fairseq-0.6.2		fairseq-0.6.2
results		results
scripts		scripts
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

e-bug/nmt-difficulty

Folders and files

Latest commit

History

Repository files navigation

Measuring Neural Translation Difficulty by Cross-Mutual Information

Requirements

Data Preparation

Training and Evaluation

Description of this repository

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages