Skip to content

Commit

Permalink
Bicleaner support + fixes (mozilla#13)
Browse files Browse the repository at this point in the history
SacreBLEU is a regular importer now and evaluation is not limited to sacrebleu datasets.
fixes

Added bicleaner-ai and bicleaner filtering (one or another based on available pretrained language packs).
fixes


Added script to find all datasets based on language pair and importer type, ready to use in config
fixes


Fixed conda environment activation to be reproducible on GCP

Other minor reproducibility fixes
  • Loading branch information
eu9ene authored Jul 26, 2021
1 parent af2abbf commit ec783cf
Show file tree
Hide file tree
Showing 35 changed files with 500 additions and 123 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@
[submodule "marian-dev"]
path = 3rd_party/marian-dev
url = https://github.com/browsermt/marian-dev
[submodule "3rd_party/kenlm"]
path = 3rd_party/kenlm
url = https://github.com/kpu/kenlm
1 change: 1 addition & 0 deletions 3rd_party/kenlm
Submodule kenlm added at bbf4fc
53 changes: 27 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ It was tested on relatively high resource language pair `ru-en`. Low resource pa
- Ubuntu 18.04 (it can work on other Linux distributions, but might require `setup` scripts fixes; see more details in [marian installation instructions](https://marian-nmt.github.io/quickstart/)).
- One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
- At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
- 64GB RAM
- 64 GB RAM (128 GB might be required for bigger datasets)
- 200+ GB of disk space ( mostly for datasets and transformations ).
It depends on chosen datasets and can be significantly higher.

Expand Down Expand Up @@ -87,9 +87,9 @@ bash ./pipeline/.../<script>.sh <args>
#### To download exported models:

```
pit pull home firefox-translations-training/models/ru-en/exported/model.ruen.intgemm.alphas.bin.gz .
pit pull home firefox-translations-training/models/ru-en/exported/lex.50.50.ruen.s2t.bin.gz .
pit pull home firefox-translations-training/models/ru-en/exported/vocab.ruen.spm.gz .
pit pull home firefox-translations-training/models/ru-en/test/exported/model.ruen.intgemm.alphas.bin.gz .
pit pull home firefox-translations-training/models/ru-en/test/exported/lex.50.50.ruen.s2t.bin.gz .
pit pull home firefox-translations-training/models/ru-en/test/exported/vocab.ruen.spm.gz .
```

### Tensorboard
Expand All @@ -110,14 +110,15 @@ Step | Description | Bottleneck | Comments
--- | --- | --- | ---
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
Data cleaning | Basic preprocessing, language specific, rule based, deduplication and other attempts to clean noisy data | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/clean_parallel.py).
Data cleaning | Basic preprocessing, language specific, rule based, deduplication, and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/clean_parallel.py).
Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning threshold is controlled by `BICLEANER_THRESHOLD` config setting.
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece).
Augmentation with back-translations | Translates mono corpus combined from `MONO_DATASETS_TRG` using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others.
Training teacher | Trains one or multiple big transformer models | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) parameters depending on datasets size. Inspired by [transformer](https://github.com/marian-nmt/marian-examples/tree/master/transformer) and [wmt2017-uedin](https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-uedin) marian examples and extended with [SentencePiece](https://github.com/google/sentencepiece).
Translation by teacher | Translates a corpus and monolingual data combined from `MONO_DATASETS_SRC` using the teacher model (ensemble is not supported yet) | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up launching the same scripts ([corpus](pipeline/translate/translate-corpus.sh), [mono](pipeline/translate/translate-mono.sh)) in parallel from another machine with access to the same network directory.
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets, so it utilizes copying to a local disk to make things faster.
Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools requires uncompressed datasets on disk and they are huge at this point. Data is copied to a local disk to make things faster. Might take 100+GB of local disk depending on a dataset size. Good CPU parallelization.
Training student | Trains a small transformer student model on filtered data and using alignments | GPU | Run [Tensorboard](pipeline/train/tensorboard/tensorboard.sh) manually to see training visualization.
Training student | Trains a small transformer student model on filtered data and using alignments | GPU | Run [Tensorboard](utils/tensorboard/tensorboard.sh) manually to see training visualization.
Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during training | GPU | Converges very quickly and then degrades. It's quick but you might want to reduce early stopping threshold.
Quantizaiton | Applies 8 bit quantization to the fined-tuned student model and evaluates on CPU | CPU | CPU threads must be set to 1 for this step.
Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format | |
Expand All @@ -129,35 +130,28 @@ Dataset importers can be used in `TRAIN_DATASETS, DEVTEST_DATASETS, MONO_DATASET
Example:
```
TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1 mtdata_newstest2019_ruen"
TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18"
```

Data source | Prefix | Name example | Type | Comments
--- | --- | --- | ---| ---
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
[OPUS](opus.nlpl.eu/) | opus | OPUS-ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `TEST_DATASETS`. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
[Paracrawl](https://paracrawl.eu/) | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only [mono datasets](https://paracrawl.eu/index.php/moredata) are used in this importer. Parallel corpus is available using opus importer.
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Some news monolingual datasets from [WMT21](https://www.statmt.org/wmt21/translation-task.html)
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)

### Adding a new importer

Just add a shell script to [corpus](pipeline/data/importers/corpus) or [mono]() which is named as `<prefix>.sh`
and accepts the same parameters as the other scripts from the same folder.
You can also use [find-corpus](pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.

Example:

## Evaluation datasets

Only [SacreBLEU](https://github.com/mjpost/sacrebleu) datasets are supported at the moment.
`python ./pipeline/utils/find-corpus en ru opus`

Example:
```
TEST_DATASETS="wmt20 wmt18"
```
### Adding a new importer

To see what datasets are available for a language pair (for example, `ru-en`) run:
```
sacrebleu --list -l ru-en
```
Just add a shell script to [corpus](pipeline/data/importers/corpus) or [mono]() which is named as `<prefix>.sh`
and accepts the same parameters as the other scripts from the same folder.

## Development

Expand Down Expand Up @@ -217,8 +211,15 @@ At the same time it is possible to run it all locally end to end or to do intera
- Scripts should automatically inspect resources available for computation and utilize them to make things faster
(number of cores, memory).

## TODO

1. Add [bicleaner](https://github.com/bitextor/bicleaner/)
2. Add translation with an ensemble of teacher models
3. Add more importers
## References

1. V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez,
"[Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task](http://www.statmt.org/wmt18/pdf/WMT116.pdf)",
in *Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers*.
Brussels, Belgium: Association for Computational Linguistics, October 2018

2. Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas
"[Bifixer and Bicleaner: two open-source tools to clean your parallel data.](https://eamt2020.inesc-id.pt/proceedings-eamt2020.pdf#page=311)",
in *Proceedings of the 22nd Annual Conference of the European Association for Machine Translation*.
Lisboa, Portugal: European Association for Machine Translation, November 2020
10 changes: 6 additions & 4 deletions config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,28 +16,30 @@ MODELS_DIR=${MODELS_DIR:-${WORKDIR}/models}
MARIAN=${MARIAN:-${WORKDIR}/3rd_party/marian-dev/build}
CLEAN_TOOLS=${WORKDIR}/pipeline/clean/tools
BIN=${WORKDIR}/bin
CONDA_DIR=${HOME}/miniconda3
TMP=/tmp

EXPERIMENT=test
SRC=ru
TRG=en

# parallel corpus
TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1"
TRAIN_DATASETS="opus_ada83/v1 opus_UN/v20090831 opus_GNOME/v1 opus_wikimedia/v20210402 opus_CCMatrix/v1 opus_Wikipedia/v1.0 opus_tico-19/v2020-10-28 opus_KDE4/v2 opus_OpenSubtitles/v2018 opus_MultiUN/v1 opus_GlobalVoices/v2018q4 opus_ELRC_2922/v1 opus_PHP/v1 opus_Tatoeba/v2021-03-10 opus_Tanzil/v1 opus_XLEnt/v1.1 opus_TildeMODEL/v2018 opus_Ubuntu/v14.10 opus_TED2013/v1.1 opus_infopankki/v1 opus_EUbookshop/v2 opus_ParaCrawl/v8 opus_Books/v1 opus_WMT-News/v2019 opus_bible-uedin/v1 opus_WikiMatrix/v1 opus_QED/v2.0a opus_CCAligned/v1 opus_TED2020/v1 opus_News-Commentary/v16 opus_UNPC/v1.0"\
" mtdata_cc_aligned mtdata_airbaltic mtdata_GlobalVoices_2018Q4 mtdata_UNv1_test mtdata_neulab_tedtalksv1_train mtdata_neulab_tedtalksv1_dev mtdata_wmt13_commoncrawl mtdata_czechtourism mtdata_paracrawl_bonus mtdata_worldbank mtdata_wiki_titles_v1 mtdata_WikiMatrix_v1 mtdata_wmt18_news_commentary_v13 mtdata_wiki_titles_v2 mtdata_news_commentary_v14 mtdata_UNv1_dev mtdata_neulab_tedtalksv1_test mtdata_JW300"
DEVTEST_DATASETS="mtdata_newstest2019_ruen mtdata_newstest2017_ruen mtdata_newstest2015_ruen mtdata_newstest2014_ruen"
# sacrebleu
TEST_DATASETS="wmt20 wmt18 wmt16 wmt13"
TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18 sacrebleu_wmt16 sacrebleu_wmt13"
# monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
# to be translated by the teacher model
MONO_DATASETS_SRC="news-crawl_news.2020 news-crawl_news.2019 news-crawl_news.2018 news-crawl_news.2017 "\
"news-crawl_news.2016 news-crawl_news.2015 news-crawl_news.2014 news-crawl_news.2013 news-crawl_news.2012 "\
"news-crawl_news.2011"
# to be translated by the shallow s2s model to augment teacher corpus with back-translations
# leave empty to skip augmentation step (high resource languages)
MONO_DATASETS_TRG="news-crawl_news.2020"
MONO_DATASETS_TRG=""
# limits per downloaded dataset
MONO_MAX_SENTENCES_SRC=100000000
MONO_MAX_SENTENCES_TRG=20000000
BICLEANER_THRESHOLD=0.5


# marian --devices parameter for GPUs to use, for example 0 1 2 3
Expand Down
34 changes: 18 additions & 16 deletions pipeline/alignment/generate-alignment-and-shortlist.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,26 @@ corpus_prefix=$1
vocab_path=$2
output_dir=$3

if [ -e "${output_dir}/corpus.aln.gz" ] && [ -e "${output_dir}/lex.s2t.pruned.gz" ]; then
echo "### Alignments and shortlist already exist, skipping"
echo "###### Done: Generating alignments and shortlist"
exit 0
fi


test -e "${BIN}/atools" || exit 1
test -e "${BIN}/extract_lex" || exit 1
test -e "${BIN}/fast_align" || exit 1

mkdir -p "${output_dir}"
dir="${TMP}/alignment"
dir="${output_dir}/tmp"
mkdir -p "${dir}"

corpus_src="${corpus_prefix}.${SRC}.gz"
corpus_trg="${corpus_prefix}.${TRG}.gz"

source "${WORKDIR}/pipeline/setup/activate-python.sh"

echo "### Subword segmentation with SentencePiece"
test -s "${dir}/corpus.spm.${SRC}.gz" ||
pigz -dc "${corpus_src}" |
Expand All @@ -41,42 +50,35 @@ test -s "${dir}/corpus.spm.${TRG}.gz" ||
pigz >"${dir}/corpus.spm.${TRG}.gz"

echo "### Creating merged corpus"
test -s "${dir}/corpus.aln.gz" || test -s "${dir}/corpus" ||
test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/corpus" ||
paste <(pigz -dc "${dir}/corpus.spm.${SRC}.gz") <(pigz -dc "${dir}/corpus.spm.${TRG}.gz") |
sed 's/\t/ ||| /' >"${dir}/corpus"

echo "### Training alignments"
test -s "${dir}/corpus.aln.gz" ||
test -s "${dir}/align.s2t.gz" ||
test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.s2t.gz" ||
"${BIN}/fast_align" -vod -i "${dir}/corpus" |
pigz >"${dir}/align.s2t.gz"
test -s "${dir}/corpus.aln.gz" ||
test -s "${dir}/align.t2s.gz" ||
test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.t2s.gz" ||
"${BIN}/fast_align" -vodr -i "${dir}/corpus" |
pigz >"${dir}/align.t2s.gz"
test -s "${dir}/corpus" && rm "${dir}/corpus"

echo "### Symmetrizing alignments"
test -s "${dir}/corpus.aln.gz" || pigz -d "${dir}/align.s2t.gz" "${dir}/align.t2s.gz"
test -s "${dir}/corpus.aln.gz" ||
test -s "${output_dir}/corpus.aln.gz" || test -s "${dir}/align.t2s" ||
pigz -d "${dir}/align.s2t.gz" "${dir}/align.t2s.gz"
test -s "${output_dir}/corpus.aln.gz" ||
"${BIN}/atools" -i "${dir}/align.s2t" -j "${dir}/align.t2s" -c grow-diag-final-and |
pigz >"${dir}/corpus.aln.gz"
test -s "${dir}/align.s2t" && rm "${dir}"/align.???
pigz >"${output_dir}/corpus.aln.gz"

echo "### Creating shortlist"
test -s "${dir}/lex.s2t.gz" ||
"${BIN}/extract_lex" \
"${dir}/corpus.spm.${TRG}.gz" \
"${dir}/corpus.spm.${SRC}.gz" \
"${dir}/corpus.aln.gz" \
"${output_dir}/corpus.aln.gz" \
"${dir}/lex.s2t" \
"${dir}/lex.t2s"
test -s "${dir}/lex.s2t" && pigz "${dir}/lex.s2t"

echo "### Cleaning"
test -s "${output_dir}/corpus.aln.gz" || rsync "${dir}/corpus.aln.gz" "${output_dir}/corpus.aln.gz"
test -e "${dir}/lex.t2s" && rm "${dir}/lex.t2s"

echo "### Shortlist pruning"
test -s "${dir}/vocab.txt" ||
"${MARIAN}/spm_export_vocab" --model="${vocab_path}" --output="${dir}/vocab.txt"
Expand Down
60 changes: 60 additions & 0 deletions pipeline/clean/bicleaner.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#!/bin/bash
##
# Cleans corpus using bicleaner-ai or bicleaner
#
# Usage:
# bash bicleaner.sh corpus_prefix output_prefix
#

set -x
set -euo pipefail

echo "###### Bicleaner filtering"

test -v SRC
test -v TRG
test -v CLEAN_TOOLS
test -v BICLEANER_THRESHOLD

corpus_prefix=$1
output_prefix=$2

output_dir=$(dirname "${output_prefix}")
tmp_dir="${output_dir}/tmp"
mkdir -p "${tmp_dir}"

source "${WORKDIR}/pipeline/setup/activate-python.sh"

# bicleaner and bicleaner-ai have conflicting dependencies. installing on demand
if [ ! -e "${output_prefix}.${SRC}.gz" ]; then
if bash "${CLEAN_TOOLS}/download-bicleaner-pack.sh" "${tmp_dir}" "bicleaner-ai"; then
echo "### Using bicleaner-ai"
pip install bicleaner-ai==1.0.1
cmd=bicleaner-ai-classify
elif bash "${CLEAN_TOOLS}/download-bicleaner-pack.sh" "${tmp_dir}" "bicleaner"; then
echo "### Using bicleaner"
pip install bicleaner==0.14
cmd=bicleaner-classify
else
echo "### Bicleaner language pack is not supported, skipping."
cp "${corpus_prefix}.${SRC}.gz" "${output_prefix}.${SRC}.gz"
cp "${corpus_prefix}.${TRG}.gz" "${output_prefix}.${TRG}.gz"
exit 0
fi
fi

echo "### Classifying and filtering"
test -s "${output_prefix}.${SRC}.gz" || test -s "${tmp_dir}/best.gz" ||
paste <(pigz -dc "${corpus_prefix}.${SRC}.gz") <(pigz -dc "${corpus_prefix}.${TRG}.gz") |
${cmd} --scol 1 --tcol 1 - - "${tmp_dir}"/*.yaml |
awk -v threshold=${BICLEANER_THRESHOLD} '{if ($3>threshold) {print $0}}' |
pigz >"${tmp_dir}/best.gz"

echo "### Writing output corpus"
test -s "${output_prefix}.${SRC}.gz" || pigz -dc "${tmp_dir}/best.gz" | cut -f1 | pigz >"${output_prefix}.${SRC}.gz"
test -s "${output_prefix}.${TRG}.gz" || pigz -dc "${tmp_dir}/best.gz" | cut -f2 | pigz >"${output_prefix}.${TRG}.gz"

echo "### Cleaning files"
rm -rf "${tmp_dir}"

echo "###### Done: Bicleaner filtering"
10 changes: 9 additions & 1 deletion pipeline/clean/ce-filter.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,23 @@ model_dir=$1
corpus_prefix=$2
output_prefix=$3

if [ -e "${output_prefix}.${TRG}.gz" ]; then
echo "### Dataset already exists, skipping"
echo "###### Done: Cross entropy filtering"
exit 0
fi

# Part of the data to be removed (0.05 is 5%)
remove=0.05
model="${model_dir}/model.npz.best-ce-mean-words.npz"
vocab="${model_dir}/vocab.spm"
dir="${TMP}/scored"
output_dir=$(dirname "${output_prefix}")
dir="${output_dir}/scored"
mkdir -p "${output_dir}"
mkdir -p "${dir}"

source "${WORKDIR}/pipeline/setup/activate-python.sh"

echo "### Decompressing corpus"
test -s "${dir}/corpus.${TRG}" || pigz -dc "${corpus_prefix}.${TRG}.gz" >"${dir}/corpus.${TRG}"
test -s "${dir}/corpus.${SRC}" || pigz -dc "${corpus_prefix}.${SRC}.gz" >"${dir}/corpus.${SRC}"
Expand Down
12 changes: 9 additions & 3 deletions pipeline/clean/clean-corpus.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,16 @@ test -v CLEAN_TOOLS
data=$1
output=$2

mkdir -p "$(dirname "${output}")"
dir="$(dirname "${output}")"
tmp="${dir}/tmp"
mkdir -p "${tmp}"

# Check if files exist
test -s "${data}.${SRC}.gz" || exit 1
test -s "${data}.${TRG}.gz" || exit 1

source "${WORKDIR}/pipeline/setup/activate-python.sh"

echo "### CLeaning ${data}"

######################################################################
Expand All @@ -41,7 +45,7 @@ done
echo "### Deduplication"
test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.nrm.uniq.gz" ||
paste <(pigz -dc "${output}.${SRC}.nrm.gz") <(pigz -dc "${output}.${TRG}.nrm.gz") |
LC_ALL=C sort -S 10G |
LC_ALL=C sort -S 10G -T "${tmp}" |
uniq |
pigz >"${output}.${SRC}${TRG}.nrm.uniq.gz"

Expand All @@ -58,7 +62,8 @@ test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.rule-based.gz"
echo "### Language identification"
test -s "${output}.${SRC}.gz" || test -s "${output}.${SRC}${TRG}.langid.gz" ||
pigz -dc "${output}.${SRC}${TRG}.rule-based.gz" |
parallel --no-notice --pipe -k -j "$(nproc)" --block 50M \
# memory intensive
parallel --no-notice --pipe -k -j "$(echo "$(nproc)"/4 | bc)" --block 50M \
"python3 -Wi ${CLEAN_TOOLS}/langid_fasttext.py -f 1 | python3 -Wi ${CLEAN_TOOLS}/langid_fasttext.py -f 1" |
grep -P "^${SRC}\t${TRG}\t" |
cut -f3,4 |
Expand All @@ -84,6 +89,7 @@ test -s "${output}.${TRG}.gz" || exit 1

echo "### Remove ${data} from intermediate steps"
rm -f "${output}".*.nrm.gz "${output}".*.nrm.uniq.gz "${output}".*.langid.gz "${output}".*.rule-based.gz
rm -rf "${tmp}"

echo "### Clean data is written to ${output}"

Expand Down
Loading

0 comments on commit ec783cf

Please sign in to comment.