forked from lisskor/firefox-translations-training
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Flores dataset importer - custom dataset importer - ability to use a pre-trained backward model - save experiment config on start - stubs for dataset caching ( decided to sync implementation with workflow manager integration ) - use best bleu models instead of best ce-mean-words - fix linting warnings
- Loading branch information
Showing
15 changed files
with
182 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
#!/bin/bash | ||
## | ||
# Use custom dataset that is already downloaded to a local disk | ||
# Local path prefix without `.<lang_code>.gz` should be specified as a "dataset" parameter | ||
# | ||
# Usage: | ||
# bash custom-corpus.sh source target dir dataset | ||
# | ||
|
||
set -x | ||
set -euo pipefail | ||
|
||
echo "###### Copying custom corpus" | ||
|
||
src=$1 | ||
trg=$2 | ||
dir=$3 | ||
dataset=$4 | ||
|
||
cp "${dataset}.${src}.gz" "${dir}/" | ||
cp "${dataset}.${trg}.gz" "${dir}/" | ||
|
||
|
||
echo "###### Done: Copying custom corpus" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
#!/bin/bash | ||
## | ||
# Downloads flores dataset | ||
# Dataset type can be "dev" or "devtest" | ||
# | ||
# Usage: | ||
# bash flores.sh source target dir dataset | ||
# | ||
|
||
set -x | ||
set -euo pipefail | ||
|
||
echo "###### Downloading flores corpus" | ||
|
||
src=$1 | ||
trg=$2 | ||
dir=$3 | ||
dataset=$4 | ||
|
||
tmp="${dir}/flores" | ||
mkdir -p "${tmp}" | ||
|
||
test -s "${tmp}/flores101_dataset.tar.gz" || | ||
wget -O "${tmp}/flores101_dataset.tar.gz" "https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz" | ||
|
||
tar -xzf "${tmp}/flores101_dataset.tar.gz" -C "${tmp}" --no-same-owner | ||
|
||
source "${WORKDIR}/pipeline/setup/activate-python.sh" | ||
|
||
flores_code() { | ||
code=$1 | ||
|
||
if [ "${code}" == "zh" ] || [ "${code}" == "zh-Hans" ]; then | ||
flores_code="zho_simpl" | ||
elif [ "${code}" == "zh-Hant" ]; then | ||
flores_code="zho_trad" | ||
else | ||
flores_code=$(python -c "from mtdata.iso import iso3_code; print(iso3_code('${code}', fail_error=True))") | ||
fi | ||
|
||
echo "${flores_code}" | ||
} | ||
|
||
src_flores=$(flores_code "${src}") | ||
trg_flores=$(flores_code "${trg}") | ||
|
||
cp "${tmp}/flores101_dataset/${dataset}/${src_flores}.${dataset}" "${dir}/flores.${src}" | ||
cp "${tmp}/flores101_dataset/${dataset}/${trg_flores}.${dataset}" "${dir}/flores.${trg}" | ||
|
||
rm -rf "${tmp}" | ||
|
||
echo "###### Done: Downloading flores corpus" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/bin/bash | ||
## | ||
# Use custom monolingual dataset that is already downloaded to a local disk | ||
# Local path prefix without `.<lang_code>.gz` should be specified as a "dataset" parameter | ||
# | ||
# Usage: | ||
# bash custom-mono.sh lang output_prefix dataset | ||
# | ||
|
||
set -x | ||
set -euo pipefail | ||
|
||
echo "###### Copying custom monolingual dataset" | ||
|
||
lang=$1 | ||
output_prefix=$2 | ||
dataset=$3 | ||
|
||
cp "${dataset}.${lang}.gz" "${output_prefix}.${lang}.gz" | ||
|
||
|
||
echo "###### Done: Copying custom monolingual dataset" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.