mbert-study/preprocessing-scripts at master · CogComp/mbert-study

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
create_config.py		create_config.py
create_pretraining_data_permutation.py		create_pretraining_data_permutation.py
frequency_based_shuffle.py		frequency_based_shuffle.py
init.sh		init.sh
mkvocab.py		mkvocab.py
preprocess_corpus.py		preprocess_corpus.py
run.sh		run.sh
shuffle_shard.py		shuffle_shard.py

README.md

Preprocessing-scripts

Preparation

Wikipedia text data if training from scratch. We also release our training data on English(en), Spanish(es), Hindi(hi), and Russian(ru) here.

We first provide a basic usage of our scripts.
init.sh will clone the official BERT repo, and create a test_data_folder with dummy text.
preprocess_corpus.py takes in a text file and tokenizes it, additional parameter can be passed to control whether the language should be fake.
run.sh will shard the text files, create vocabulary for it, create bert-readable tensorflow records, and upload to google cloud.
create_pretraining_data_permutation.py allows creating pre-training data with permuted sentences, where the permute probability and method can be freely chosen. frequency_based_shuffle.py takes in a text corpus, and shuffles such that every word is replaced by a random word from the distribution of its vocabulary.

An example run that creates data that contains English and English Fake:

./init.sh
python preprocess_corpus.py \
    --corpus test_data_folder/raw_text/test.txt \
    --output test_data_folder/txt/en.txt
python preprocess_corpus.py \
    --corpus test_data_folder/raw_text/test.txt \
    --output test_data_folder/txt/en-fake.txt \
    --make_fake
./run.sh

run.sh requires a valid google cloud storage bucket to upload the data to gcloud. It also requires gsutil to copy the files to the bucket.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocessing-scripts

preprocessing-scripts

README.md

Preprocessing-scripts

Preparation

Files

preprocessing-scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

preprocessing-scripts

Folders and files

parent directory

README.md

Preprocessing-scripts

Preparation