Skip to content

MLRS/Balancing-Fluency-and-Adherence-Hybrid-Fallback-Term-Injection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation

License: MIT

This repository contains the accompanying code for the paper:

Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation
Kurt Abela¹, Marc Tanti², and Claudia Borg¹
¹Department of Artificial Intelligence, University of Malta
²Institute of Linguistics and Language Technology, University of Malta
Accepted in LoResMT 2026.

Citation

If you use this code or our findings in your research, please cite:

@inproceedings{abela2026balancing,
  title={Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation},
  author={Abela, Kurt and Tanti, Marc and Borg, Claudia},
  booktitle={Proceedings of the 9th Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT)},
  year={2026}
}

Overview

This project explores strategies for injecting terminology into Machine Translation (MT) models for low-resource languages (Maltese and Slovak). We propose a Hybrid Fallback approach that combines the fluency of static constrained training (Acontextual Drill) with the high adherence of Constrained Beam Search (CBS), using the latter only when the static model fails to include the required terminology.

Directory Structure

To use these scripts, your project directory should be organized as follows. Note that subdirectories within data/ and models/ depend on the target language being processed (e.g., mt for Maltese, sk for Slovak).

.
├── fairseq/              # Modified Fairseq library
├── scripts/              # Training, generation, and evaluation scripts
├── data/                 # Training and test datasets (Language-specific subsets)
│   ├── processed/        # Maltese processed data
│   └── slovak/           # Slovak processed data
├── models/               # Maltese model checkpoints
├── models_sk/            # Slovak model checkpoints
├── results/              # Maltese translation outputs and analysis
├── results_sk/           # Slovak translation outputs and analysis
└── static_vocabs/        # Pre-defined dictionaries for Fairseq

Installation

The scripts are designed to run in a Slurm-managed environment with CUDA support.

  1. Clone the repository:

    git clone <repository-url>
    cd Balancing-Fluency-and-Adherence
  2. Environment Setup: The scripts (train_baseline.sh, etc.) will automatically attempt to create a Conda environment and install dependencies. To do this manually:

    conda create -n constrained_mt python=3.8 -y
    conda activate constrained_mt
    pip install -r requirements.txt
    cd fairseq
    pip install --editable ./
    python setup.py build_ext --inplace

Usage

All main workflows are provided as Slurm shell scripts in the scripts/ directory. Each script takes a language pair as its first argument (en-mt or en-sk).

1. Training the Baseline Model

Trains a standard Transformer model on the initial parallel dataset.

sbatch scripts/train_baseline.sh en-mt

2. Fine-tuning Models

Fine-tunes the baseline model using various strategies:

  • M1_control: Fine-tuned on in-domain parallel data.
  • M2_augmented: Fine-tuned on data with inline term annotations.
  • M3_drill: Static constrained training (Acontextual Drill).
  • M4_generic_drill: Drill training using only the terminology dictionary.
  • M5_drill_seen: Drill training limited to terms present in the training set.
  • M6_noun_drill: Drill training focused on noun-classified terms.
sbatch scripts/finetune_models.sh en-mt

3. Generating Translations

Generates translations for all models, including Constrained Beam Search (CBS) variants and the Hybrid Fallback model.

sbatch scripts/generate_translations.sh en-mt

4. Evaluation and Analysis

Calculates metrics (BLEU, chrF++, COMET, TIR) and performs significance testing. Results are saved in an Excel report.

sbatch scripts/evaluate.sh en-mt

Configuration

The scripts use a PROJECT_ROOT variable which defaults to the current working directory ($(pwd)). If you are running the scripts from a different location, ensure you modify this path at the top of the shell scripts.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published