Bicleaner support + fixes (mozilla#13)

SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets. fixes Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs). fixes Added script to find all datasets based on language pair and importer type, ready to use in config fixes Fixed conda environment activation to be reproducible on GCP Other minor reproducibility fixes
MaksymDel · Jul 26, 2021 · ec783cf · ec783cf
1 parent af2abbf
commit ec783cf
Show file tree

Hide file tree

Showing 35 changed files with 500 additions and 123 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -7,3 +7,6 @@
 [submodule "marian-dev"]
 	path = 3rd_party/marian-dev
 	url = https://github.com/browsermt/marian-dev
+[submodule "3rd_party/kenlm"]
+	path = 3rd_party/kenlm
+	url = https://github.com/kpu/kenlm
diff --git a/3rd_party/kenlm b/3rd_party/kenlm
diff --git a/README.md b/README.md
@@ -13,7 +13,7 @@ It was tested on relatively high resource language pair `ru-en`. Low resource pa
 - Ubuntu 18.04 (it can work on other Linux distributions, but might require `setup` scripts fixes; see more details in [marian installation instructions](https://marian-nmt.github.io/quickstart/)).
 - One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
 - At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
-- 64GB RAM
+- 64 GB RAM (128 GB might be required for bigger datasets)
 - 200+ GB of disk space ( mostly for datasets and transformations ). 
   It depends on chosen datasets and can be significantly higher.
 
@@ -87,9 +87,9 @@ bash ./pipeline/.../<script>.sh <args>
 #### To download exported models:
 
 ```
-pit pull home firefox-translations-training/models/ru-en/exported/model.ruen.intgemm.alphas.bin.gz .
-pit pull home firefox-translations-training/models/ru-en/exported/lex.50.50.ruen.s2t.bin.gz .
-pit pull home firefox-translations-training/models/ru-en/exported/vocab.ruen.spm.gz .
+pit pull home firefox-translations-training/models/ru-en/test/exported/model.ruen.intgemm.alphas.bin.gz .
+pit pull home firefox-translations-training/models/ru-en/test/exported/lex.50.50.ruen.s2t.bin.gz .
+pit pull home firefox-translations-training/models/ru-en/test/exported/vocab.ruen.spm.gz .
 ```
 
 ### Tensorboard
@@ -110,14 +110,15 @@ Step | Description | Bottleneck | Comments
 --- | --- | --- | ---
 Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
 Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
-Data cleaning | Basic preprocessing, language specific, rule based, deduplication and other attempts to clean noisy data | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/clean_parallel.py).
+Data cleaning | Basic preprocessing, language specific, rule based, deduplication,  and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/clean_parallel.py).
+Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning threshold is controlled by `BICLEANER_THRESHOLD` config setting.
 Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
 Augmentation with back-translations | Translates mono corpus combined from `MONO_DATASETS_TRG` using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
 Training teacher | Trains one or multiple big transformer models | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) parameters depending on datasets size. Inspired by [transformer](https://github.com/marian-nmt/marian-examples/tree/master/transformer) and [wmt2017-uedin](https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin) marian examples and extended with [SentencePiece](https://github.com/google/sentencepiece).
 Translation by teacher | Translates a corpus and monolingual data combined from `MONO_DATASETS_SRC` using the teacher model (ensemble is not supported yet) | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up launching the same scripts ([corpus](pipeline/translate/translate-corpus.sh), [mono](pipeline/translate/translate-mono.sh)) in parallel from another machine with access to the same network directory.
 Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets, so it utilizes copying to a local disk to make things faster.
 Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools requires uncompressed datasets on disk and they are huge at this point. Data is copied to a local disk to make things faster. Might take 100+GB of local disk depending on a dataset size. Good CPU parallelization.
-Training student | Trains a small transformer student model on filtered data and using alignments | GPU | Run [Tensorboard](pipeline/train/tensorboard/tensorboard.sh) manually to see training visualization.
+Training student | Trains a small transformer student model on filtered data and using alignments | GPU | Run [Tensorboard](utils/tensorboard/tensorboard.sh) manually to see training visualization.
 Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during training | GPU | Converges very quickly and then degrades. It's quick but you might want to reduce early stopping threshold.
 Quantizaiton |  Applies 8 bit quantization to the fined-tuned student model and evaluates on CPU | CPU | CPU threads must be set to 1 for this step.
 Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format | |
@@ -129,35 +130,28 @@ Dataset importers can be used in `TRAIN_DATASETS, DEVTEST_DATASETS, MONO_DATASET
 Example:
 ```
 TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1 mtdata_newstest2019_ruen"
+TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18"
 ```
 
 Data source | Prefix | Name example | Type | Comments
 --- | --- | --- | ---| ---
 [MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
-[OPUS](opus.nlpl.eu/) | opus | OPUS-ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
+[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
+[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `TEST_DATASETS`. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
 [Paracrawl](https://paracrawl.eu/) | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only [mono datasets](https://paracrawl.eu/index.php/moredata) are used in this importer. Parallel corpus is available using opus importer.
 [News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Some news monolingual datasets from [WMT21](https://www.statmt.org/wmt21/translation-task.html)
 [Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
 
-### Adding a new importer
-
-Just add a shell script to [corpus](pipeline/data/importers/corpus) or [mono]() which is named as `<prefix>.sh` 
-and accepts the same parameters as the other scripts from the same folder.
+You can also use [find-corpus](pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
 
+Example:
 
-## Evaluation datasets
-
-Only [SacreBLEU](https://github.com/mjpost/sacrebleu) datasets are supported at the moment.
+`python ./pipeline/utils/find-corpus en ru opus`
 
-Example:
-```
-TEST_DATASETS="wmt20 wmt18"
-```
+### Adding a new importer
 
-To see what datasets are available for a language pair (for example, `ru-en`) run:
-```
-sacrebleu --list -l ru-en
-```
+Just add a shell script to [corpus](pipeline/data/importers/corpus) or [mono]() which is named as `<prefix>.sh` 
+and accepts the same parameters as the other scripts from the same folder.
 
 ## Development
 
@@ -217,8 +211,15 @@ At the same time it is possible to run it all locally end to end or to do intera
 - Scripts should automatically inspect resources available for computation and utilize them to make things faster
   (number of cores, memory).
 
-## TODO
 
-1. Add [bicleaner](https://github.com/bitextor/bicleaner/)
-2. Add translation with an ensemble of teacher models
-3. Add more importers
+## References
+
+1. V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez, 
+"[Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task](http://www.statmt.org/wmt18/pdf/WMT116.pdf)",
+in *Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers*.
+Brussels, Belgium: Association for Computational Linguistics, October 2018
+
+2. Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas 
+"[Bifixer and Bicleaner: two open-source tools to clean your parallel data.](https://eamt2020.inesc-id.pt/proceedings-eamt2020.pdf#page=311)",
+in *Proceedings of the 22nd Annual Conference of the European Association for Machine Translation*.
+Lisboa, Portugal: European Association for Machine Translation, November 2020
diff --git a/config.sh b/config.sh
@@ -16,28 +16,30 @@ MODELS_DIR=${MODELS_DIR:-${WORKDIR}/models}
 MARIAN=${MARIAN:-${WORKDIR}/3rd_party/marian-dev/build}
 CLEAN_TOOLS=${WORKDIR}/pipeline/clean/tools
 BIN=${WORKDIR}/bin
+CONDA_DIR=${HOME}/miniconda3
 TMP=/tmp
 
 EXPERIMENT=test
 SRC=ru
 TRG=en
 
 # parallel corpus
-TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1"
+TRAIN_DATASETS="opus_ada83/v1 opus_UN/v20090831 opus_GNOME/v1 opus_wikimedia/v20210402 opus_CCMatrix/v1 opus_Wikipedia/v1.0 opus_tico-19/v2020-10-28 opus_KDE4/v2 opus_OpenSubtitles/v2018 opus_MultiUN/v1 opus_GlobalVoices/v2018q4 opus_ELRC_2922/v1 opus_PHP/v1 opus_Tatoeba/v2021-03-10 opus_Tanzil/v1 opus_XLEnt/v1.1 opus_TildeMODEL/v2018 opus_Ubuntu/v14.10 opus_TED2013/v1.1 opus_infopankki/v1 opus_EUbookshop/v2 opus_ParaCrawl/v8 opus_Books/v1 opus_WMT-News/v2019 opus_bible-uedin/v1 opus_WikiMatrix/v1 opus_QED/v2.0a opus_CCAligned/v1 opus_TED2020/v1 opus_News-Commentary/v16 opus_UNPC/v1.0"\
+" mtdata_cc_aligned mtdata_airbaltic mtdata_GlobalVoices_2018Q4 mtdata_UNv1_test mtdata_neulab_tedtalksv1_train mtdata_neulab_tedtalksv1_dev mtdata_wmt13_commoncrawl mtdata_czechtourism mtdata_paracrawl_bonus mtdata_worldbank mtdata_wiki_titles_v1 mtdata_WikiMatrix_v1 mtdata_wmt18_news_commentary_v13 mtdata_wiki_titles_v2 mtdata_news_commentary_v14 mtdata_UNv1_dev mtdata_neulab_tedtalksv1_test mtdata_JW300"
 DEVTEST_DATASETS="mtdata_newstest2019_ruen mtdata_newstest2017_ruen mtdata_newstest2015_ruen mtdata_newstest2014_ruen"
-# sacrebleu
-TEST_DATASETS="wmt20 wmt18 wmt16 wmt13"
+TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18 sacrebleu_wmt16 sacrebleu_wmt13"
 # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
 # to be translated by the teacher model
 MONO_DATASETS_SRC="news-crawl_news.2020 news-crawl_news.2019 news-crawl_news.2018 news-crawl_news.2017 "\
 "news-crawl_news.2016 news-crawl_news.2015 news-crawl_news.2014 news-crawl_news.2013 news-crawl_news.2012 "\
 "news-crawl_news.2011"
 # to be translated by the shallow s2s model to augment teacher corpus with back-translations
 # leave empty to skip augmentation step (high resource languages)
-MONO_DATASETS_TRG="news-crawl_news.2020"
+MONO_DATASETS_TRG=""
 # limits per downloaded dataset
 MONO_MAX_SENTENCES_SRC=100000000
 MONO_MAX_SENTENCES_TRG=20000000
+BICLEANER_THRESHOLD=0.5
 
 
 # marian --devices parameter for GPUs to use, for example 0 1 2 3

diff --git a/pipeline/alignment/generate-alignment-and-shortlist.sh b/pipeline/alignment/generate-alignment-and-shortlist.sh
@@ -19,17 +19,26 @@ corpus_prefix=$1
 vocab_path=$2
 output_dir=$3
 
+if [ -e "${output_dir}/corpus.aln.gz" ] && [ -e "${output_dir}/lex.s2t.pruned.gz" ]; then
+  echo "### Alignments and shortlist already exist, skipping"
+  echo "###### Done: Generating alignments and shortlist"
+  exit 0
+fi
+
+
 test -e "${BIN}/atools" || exit 1
 test -e "${BIN}/extract_lex" || exit 1
 test -e "${BIN}/fast_align" || exit 1
 
 mkdir -p "${output_dir}"
-dir="${TMP}/alignment"
+dir="${output_dir}/tmp"
 mkdir -p "${dir}"
 
 corpus_src="${corpus_prefix}.${SRC}.gz"
 corpus_trg="${corpus_prefix}.${TRG}.gz"
 
+source "${WORKDIR}/pipeline/setup/activate-python.sh"
+
 echo "### Subword segmentation with SentencePiece"
 test -s "${dir}/corpus.spm.${SRC}.gz" ||
   pigz -dc "${corpus_src}" |
@@ -41,42 +50,35 @@ test -s "${dir}/corpus.spm.${TRG}.gz" ||
   pigz >"${dir}/corpus.spm.${TRG}.gz"
 
 echo "### Creating merged corpus"
-test -s "${dir}/corpus.aln.gz" || test -s "${dir}/corpus" ||
+test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/corpus" ||
   paste <(pigz -dc "${dir}/corpus.spm.${SRC}.gz") <(pigz -dc "${dir}/corpus.spm.${TRG}.gz") |
   sed 's/\t/ ||| /' >"${dir}/corpus"
 
 echo "### Training alignments"
-test -s "${dir}/corpus.aln.gz" ||
-  test -s "${dir}/align.s2t.gz" ||
+test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.s2t.gz" ||
   "${BIN}/fast_align" -vod -i "${dir}/corpus" |
   pigz >"${dir}/align.s2t.gz"
-test -s "${dir}/corpus.aln.gz" ||
-  test -s "${dir}/align.t2s.gz" ||
+test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.t2s.gz" ||
   "${BIN}/fast_align" -vodr -i "${dir}/corpus" |
   pigz >"${dir}/align.t2s.gz"
-test -s "${dir}/corpus" && rm "${dir}/corpus"
 
 echo "### Symmetrizing alignments"
-test -s "${dir}/corpus.aln.gz" || pigz -d "${dir}/align.s2t.gz" "${dir}/align.t2s.gz"
-test -s "${dir}/corpus.aln.gz" ||
+test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.t2s" ||
+  pigz -d "${dir}/align.s2t.gz" "${dir}/align.t2s.gz"
+test -s "${output_dir}/corpus.aln.gz" ||
   "${BIN}/atools" -i "${dir}/align.s2t" -j "${dir}/align.t2s" -c grow-diag-final-and |
-  pigz >"${dir}/corpus.aln.gz"
-test -s "${dir}/align.s2t" && rm "${dir}"/align.???
+  pigz >"${output_dir}/corpus.aln.gz"
 
 echo "### Creating shortlist"
 test -s "${dir}/lex.s2t.gz" ||
   "${BIN}/extract_lex" \
     "${dir}/corpus.spm.${TRG}.gz" \
     "${dir}/corpus.spm.${SRC}.gz" \
-    "${dir}/corpus.aln.gz" \
+    "${output_dir}/corpus.aln.gz" \
     "${dir}/lex.s2t" \
     "${dir}/lex.t2s"
 test -s "${dir}/lex.s2t" && pigz "${dir}/lex.s2t"
 
-echo "### Cleaning"
-test -s "${output_dir}/corpus.aln.gz" || rsync "${dir}/corpus.aln.gz" "${output_dir}/corpus.aln.gz"
-test -e "${dir}/lex.t2s" && rm "${dir}/lex.t2s"
-
 echo "### Shortlist pruning"
 test -s "${dir}/vocab.txt" ||
   "${MARIAN}/spm_export_vocab" --model="${vocab_path}" --output="${dir}/vocab.txt"

diff --git a/pipeline/clean/bicleaner.sh b/pipeline/clean/bicleaner.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+##
+# Cleans corpus using bicleaner-ai or bicleaner
+#
+# Usage:
+#   bash bicleaner.sh corpus_prefix output_prefix
+#
+
+set -x
+set -euo pipefail
+
+echo "###### Bicleaner filtering"
+
+test -v SRC
+test -v TRG
+test -v CLEAN_TOOLS
+test -v BICLEANER_THRESHOLD
+
+corpus_prefix=$1
+output_prefix=$2
+
+output_dir=$(dirname "${output_prefix}")
+tmp_dir="${output_dir}/tmp"
+mkdir -p "${tmp_dir}"
+
+source "${WORKDIR}/pipeline/setup/activate-python.sh"
+
+# bicleaner and bicleaner-ai have conflicting dependencies. installing on demand
+if [ ! -e "${output_prefix}.${SRC}.gz" ]; then
+  if bash "${CLEAN_TOOLS}/download-bicleaner-pack.sh" "${tmp_dir}" "bicleaner-ai"; then
+    echo "### Using bicleaner-ai"
+    pip install bicleaner-ai==1.0.1
+    cmd=bicleaner-ai-classify
+  elif bash "${CLEAN_TOOLS}/download-bicleaner-pack.sh" "${tmp_dir}" "bicleaner"; then
+    echo "### Using bicleaner"
+    pip install bicleaner==0.14
+    cmd=bicleaner-classify
+  else
+    echo "### Bicleaner language pack is not supported, skipping."
+    cp "${corpus_prefix}.${SRC}.gz" "${output_prefix}.${SRC}.gz"
+    cp "${corpus_prefix}.${TRG}.gz" "${output_prefix}.${TRG}.gz"
+    exit 0
+  fi
+fi
+
+echo "### Classifying and filtering"
+test -s "${output_prefix}.${SRC}.gz" || test -s "${tmp_dir}/best.gz" ||
+  paste <(pigz -dc "${corpus_prefix}.${SRC}.gz") <(pigz -dc "${corpus_prefix}.${TRG}.gz") |
+  ${cmd} --scol 1 --tcol 1 - - "${tmp_dir}"/*.yaml |
+  awk -v threshold=${BICLEANER_THRESHOLD} '{if ($3>threshold) {print $0}}' |
+  pigz >"${tmp_dir}/best.gz"
+
+echo "### Writing output corpus"
+test -s "${output_prefix}.${SRC}.gz" || pigz -dc "${tmp_dir}/best.gz" | cut -f1 | pigz >"${output_prefix}.${SRC}.gz"
+test -s "${output_prefix}.${TRG}.gz" || pigz -dc "${tmp_dir}/best.gz" | cut -f2 | pigz >"${output_prefix}.${TRG}.gz"
+
+echo "### Cleaning files"
+rm -rf "${tmp_dir}"
+
+echo "###### Done: Bicleaner filtering"
diff --git a/pipeline/clean/ce-filter.sh b/pipeline/clean/ce-filter.sh
@@ -21,15 +21,23 @@ model_dir=$1
 corpus_prefix=$2
 output_prefix=$3
 
+if [ -e "${output_prefix}.${TRG}.gz" ]; then
+  echo "### Dataset already exists, skipping"
+  echo "###### Done: Cross entropy filtering"
+  exit 0
+fi
+
 # Part of the data to be removed (0.05 is 5%)
 remove=0.05
 model="${model_dir}/model.npz.best-ce-mean-words.npz"
 vocab="${model_dir}/vocab.spm"
-dir="${TMP}/scored"
 output_dir=$(dirname "${output_prefix}")
+dir="${output_dir}/scored"
 mkdir -p "${output_dir}"
 mkdir -p "${dir}"
 
+source "${WORKDIR}/pipeline/setup/activate-python.sh"
+
 echo "### Decompressing corpus"
 test -s "${dir}/corpus.${TRG}" || pigz -dc "${corpus_prefix}.${TRG}.gz" >"${dir}/corpus.${TRG}"
 test -s "${dir}/corpus.${SRC}" || pigz -dc "${corpus_prefix}.${SRC}.gz" >"${dir}/corpus.${SRC}"

diff --git a/pipeline/clean/clean-corpus.sh b/pipeline/clean/clean-corpus.sh
@@ -19,12 +19,16 @@ test -v CLEAN_TOOLS
 data=$1
 output=$2
 
-mkdir -p "$(dirname "${output}")"
+dir="$(dirname "${output}")"
+tmp="${dir}/tmp"
+mkdir -p "${tmp}"
 
 # Check if files exist
 test -s "${data}.${SRC}.gz" || exit 1
 test -s "${data}.${TRG}.gz" || exit 1
 
+source "${WORKDIR}/pipeline/setup/activate-python.sh"
+
 echo "### CLeaning ${data}"
 
 ######################################################################
@@ -41,7 +45,7 @@ done
 echo "### Deduplication"
 test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.nrm.uniq.gz" ||
   paste <(pigz -dc "${output}.${SRC}.nrm.gz") <(pigz -dc "${output}.${TRG}.nrm.gz") |
-  LC_ALL=C sort -S 10G |
+  LC_ALL=C sort -S 10G -T "${tmp}" |
   uniq |
   pigz >"${output}.${SRC}${TRG}.nrm.uniq.gz"
 
@@ -58,7 +62,8 @@ test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.rule-based.gz"
 echo "### Language identification"
 test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.langid.gz" ||
   pigz -dc "${output}.${SRC}${TRG}.rule-based.gz" |
-  parallel --no-notice --pipe -k -j "$(nproc)" --block 50M \
+  # memory intensive
+  parallel --no-notice --pipe -k -j "$(echo "$(nproc)"/4 | bc)" --block 50M \
     "python3 -Wi ${CLEAN_TOOLS}/langid_fasttext.py -f 1 | python3 -Wi ${CLEAN_TOOLS}/langid_fasttext.py -f 1" |
   grep -P "^${SRC}\t${TRG}\t" |
   cut -f3,4 |
@@ -84,6 +89,7 @@ test -s "${output}.${TRG}.gz" || exit 1
 
 echo "### Remove ${data} from intermediate steps"
 rm -f "${output}".*.nrm.gz "${output}".*.nrm.uniq.gz "${output}".*.langid.gz "${output}".*.rule-based.gz
+rm -rf "${tmp}"
 
 echo "### Clean data is written to  ${output}"