Multilingual Neural Machine Translation for Low-Resource Languages

Project Setup and Requirements

Requirements

The project requires Python ^3.11 version. Other dependencies are listed in the pyproject.toml file.

Installation

The project uses Poetry to manage dependencies. To install the dependencies on Snellius, run the following command:

# snel specific
/sw/arch/RHEL8/EB_production/2023/software/Anaconda3/2023.07-2/bin/conda init bash

# restart shell
conda create python=3.11 -n venv
conda activate venv

pip install poetry

## or use poetry to create a virtual environment
# poetry env use python3.11
# poetry shell

poetry install
poetry run pre-commit install

Steps to run

Connect to Snellius and request resources

You can request resources using the following command:

srun --partition=gpu --gpus=1 --ntasks=1 --cpus-per-task=18  --time=00:01:00 --pty bash -i

conda activate venv

or run a file with the job specification that will run jobs in the background on Snellius. This allows you to submit jobs that will keep running even if you disconnect from the server. Example of this file is train_job.slurm, You can adjust it to Your needs and run:

sbatch train_job.slurm

To view the status of the job, you can use the following command:

squeue -u $USER

Download the data

Download the dataset of the chosen language. Run the script and choose source and target languages:

python scripts/download_nllb.py

Augment the downloaded data

To augment the dataset by using backtranslation, run the following script and provide all the necessary arguments:

dataset_path
output_path
lang_from
lang_to

Set the language_from to the target language (e.g. Estonian) and language_to which usually is English.

Example:

python scripts/augment_data_backtranslate.py --dataset_path data/opus.nllb.en-ee/en-ee.txt/NLLB.en-ee.en --output_path out/backtranslated-ee --lang_from ee --lang_to en

You could also run the script with the following command:

sbatch backtranslate_job.slurm

Convert txt file into parquet

Convert generated .txt files from backtranslation to .parquet file for training.

python scripts/convert_to_parquet.py --data_dir=out/backtranslated-ee --original_data_dir=data/opus.nllb.en-ee/en-ee.txt/NLLB.en-ee.ee --output_parquet_file=data/bt-opus.nllb.en-ee/nllb-ee-backtranslated.parquet --orig_lang=ee

Training

Run the training script with the name of the configuration file as an argument:

python scripts/train.py train_bt_ee.yaml

Or run the training script with the job specification file and adjust it to your needs:

sbatch train_job.slurm

Augment LLM data

To augment the dataset by using the LLM, run the following script and provide all the necessary arguments:

dataset_path
output_path
lang_from

Set the language_from to the target language (e.g. Estonian, Afrikans).

python scripts/augment_data_llm.py --dataset_path=data/opus.nllb.en-ee/en-ee.txt/NLLB.en-ee.ee --output_path=out/llm-ee --lang_from=Estonian

Or run the script with the job specification file and adjust it to your needs:

sbatch augment_llm_job.slurm

Join LLM augmented data

To join the LLM augmented data with the original data, run the following script and provide all the necessary arguments:

dataset_path
output_path

python scripts/join_llm_augmented_data.py --dataset_path=out/llm-ee --output_file_path=data/llm.nllb.en-ee/nllb-ee-llm.txt

Translate LLM data

To translate the data using LLM model, run the following script and provide all the necessary arguments:

dataset_path
output_path

python scripts/translate_data_llm.py --dataset_path=data/opus.nllb.en-ee/en-ee.txt/NLLB.en-ee.en --output_path=out/translated-llm-ee

Or run the script with the job specification file and adjust it to your needs:

sbatch translate_llm_job.slurm

Join LLM translated data

Run the same script as for joining LLM augmented data, but adjust parameters.

Training with LLM augmented data and LLM translated data

To run the training script with the LLM augmented data and LLM translated data, run the same script as for training backtranslation data, but adjust the configuration files.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
configs		configs
data/samples		data/samples
docs		docs
out		out
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
backtranslate_job.slurm		backtranslate_job.slurm
evaluate.ipynb		evaluate.ipynb
llm_augment_job.slurm		llm_augment_job.slurm
llm_translate_job.slurm		llm_translate_job.slurm
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run-fr.sh		run-fr.sh
run-uk.sh		run-uk.sh
train_job.slurm		train_job.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Neural Machine Translation for Low-Resource Languages

Project Setup and Requirements

Requirements

Installation

Steps to run

Connect to Snellius and request resources

Download the data

Augment the downloaded data

Convert txt file into parquet

Training

Augment LLM data

Join LLM augmented data

Translate LLM data

Join LLM translated data

Training with LLM augmented data and LLM translated data

About

Releases

Packages

Contributors 3

Languages

dqmis/nmt-for-low-resource-lang

Folders and files

Latest commit

History

Repository files navigation

Multilingual Neural Machine Translation for Low-Resource Languages

Project Setup and Requirements

Requirements

Installation

Steps to run

Connect to Snellius and request resources

Download the data

Augment the downloaded data

Convert txt file into parquet

Training

Augment LLM data

Join LLM augmented data

Translate LLM data

Join LLM translated data

Training with LLM augmented data and LLM translated data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages