Training script #5

lparam · 2017-11-26T15:16:52Z

@bheinzerling
Could you provide training script?I want to train with my own data.

alejandrojcastaneira · 2019-02-20T10:56:58Z

Hello. Thanks for the great work! again. I'm also interested in training a BPEmb embeddings on my custom data. There would be way or an example of how to apply this feature.

Best Regards

Danil328 · 2019-04-10T08:50:33Z

+1

bheinzerling · 2019-04-10T09:31:31Z

Most of my original training script deals with training many different embeddings for all languages on a cluster (not sure how much sense it makes to share this), but the basic procedure is quite simple:

Preprocess corpus.
Learn BPE model on corpus, using SentencePiece.
Encode corpus with BPE model, again using SentencePiece.
Learn embeddings on encoded corpus, using GloVe.

$sentencepiece_dir=/install/sentencepiece/and/set/this/path
$glove_dir=/install/glove/and/set/this/path

$corpus=corpus.txt
$corpus_preproc=corpus_preproc.txt
$vocab_size=100000
$emb_dim=100
$model_type=bpe
$model_prefix=${corpus_preproc}.${model_type}.${vocab_size}
$emb_out=$model_prefix.d${emb_dim}

# preprocessing
# you probably want to lowercase everything and replace all digits with 0
# the preprocessing I used is quite specific to Wikipedia, depending on your corpus you can do something much simpler

# remove wikipedia section header === and article title ''' markers, silly sentence split on "  " and remove initial whitespace
sed "s/===\+/\n/g;s/'''//g;s/  /\n/g" $corpus | perl -C -pe 's/\x{200B}|\x{200C}|\x{200D}|\x{200E}|\x{202C}|\x{96}//g' | tr -s [[:blank:]] " " | sed -re 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//g;s#(https?://[^">< ]+)#🔗#g;s/[0-9]/0/g;s/^ \+//'  | grep ".\{100\}" | sed "s/^ //" > $corpus_preproc

# train SentencePiece model
$sentencepiece_dir/bin/spm_train --split_by_whitespace true --input $corpus_preproc --model_prefix $model_prefix --vocab_size $vocab_size --model_type $model_type

# encode preprocessed corpus with the trained SentencePiece model
$model_file=${model_prefix}.model
$corpus_encoded=corpus_encoded.txt
# encoding to numerical IDs (--output_format=id) saves you headaches if your corpus contains weird whitespace characters that might get treated differently between SentencePiece and Glove. You can leave this out if your corpus is quite clean.
cat $corpus_preproc | $sentencepiece_dir/bin/spm_encode --model $model_file --output $corpus_encoded --extra_options=bos:eos # --output_format=id

# train BPE embeddings with GloVe
$glove_dir/run.sh $corpus_encoded $emb_out $emb_dim

This will give you BPE embeddings in GloVe format ${emb_out}.glove.txt

I copy&pasted this from my actual scripts, let me know if this works for you.

Finally, the embeddings in glove format are in different order than the subwords in the BPE vocabulary, so the last step is to reorder them. If the above works for you I can think of a way to properly add this to the repo (not just a comment) and maybe can make it into a push-button solution.

Danil328 · 2019-04-10T12:53:37Z

Thank you very much!

alejandrojcastaneira · 2019-06-06T14:45:51Z

Hello, I manage to train my own embeddings using glove based on your spm then I try to load them into bpemb as you commented in #23 by using:

from bpemb.util import sentencepiece_load, load_word2vec_file

bpemb = BPEmb(lang='en')
bpemb.spm = sentencepiece_load('/some/folder/en.wiki.bpe.vs200000.model')
bpemb.emb = load_word2vec_file('/some/folder/my_byte_pair_emb.w2v.bin')

but I still didn't make the reorder of the vectors, could you help me with an insight on this?

Best regards

bheinzerling · 2019-06-10T02:36:30Z

Assuming you have a SentencePiece .vocab file for your model, let's first write a helper function for loading this:

def get_vocab(vocab_file, vocab_size):
    with vocab_file.open(encoding="utf8") as f:
        # read lines, ignoring fun characters such as 'LINE SEPARATOR' (U+2028)
        # which Python treats as line breaks when reading files
        # with the ususal 'for line in f' pattern
        vocab_lines = f.read().split("\n")[:-1]
    assert len(vocab_lines) == vocab_size
    vocab, ranks = zip(*map(lambda l: l.split("\t"), vocab_lines))
    return vocab

Now the function for converting from GloVe order embeddings to the proper order:

from gensim.models import keyedvectors
from dougu import to_from_idx  # https://github.com/bheinzerling/dougu/blob/d90e6c0ba92e61378c3c03df78ce5ba020f65ff8/dougu/iters.py#L70
import numpy as np

def convert_emb(glove_order_vocab_file, glove_order_emb_file):
    glove_order_vocab = get_vocab(glove_order_vocab_file)
    piece2id, id2piece = to_from_idx(vocab)
    glove_order_emb = keyedvectors.load_word2vec_format(glove_order_emb_file)
    v = glove_order_emb.vectors
    # sample embeddings for symbols that didn't occur in the training
    # data from normal distribution with same mean and variance
    new_v = v.std() * np.random.randn(len(glove_order_vocab), v.shape[1]) + v.mean()
    new_vocab = {}
    # go through all entries (piece) in the vocabulary with their corresponding id
    for id, piece in id2piece.items():
        try:
            new_v[id] = glove_order_emb[str(id)]  # str(id) assumes you used '--output_format=id', as described here https://github.com/bheinzerling/bpemb/issues/5#issuecomment-481616023
        except KeyError:
            pass
        # gensim sorts embeddings by -count when saving
        # set count to -id to preserve sentencepiece order
        assert piece not in new_vocab
        new_vocab[piece] = keyedvectors.Vocab(count=-id, index=id)

    proper_order_emb.index2word = id2piece
    proper_order_emb.vocab = new_vocab
    proper_order_emb.vectors = new_v
    return proper_order_emb

Copied this together from my actual scripts, let me know if this works for you.

stefan-it · 2019-09-03T11:21:54Z

@bheinzerling I would be awesome if the training routine could be added here (I'm currently training bpemb's for historic texts).

Currently, I'm using the default parameters as provided in the GloVe demo script (I only adjusted dimenstion size to 300) 🤗

bheinzerling · 2019-09-04T02:04:26Z

@stefan-it The main difference to the demo script is setting VOCAB_MIN_COUNT=0 which creates embeddings for all byte-pair symbols, not just frequent ones.

#! /usr/bin/env bash
set -eou pipefail

# set this to something else if you want to keep GloVe co-occurrence files permanently,
# say, to create embeddings of the same corpus with different dimensions
TMP=/tmp
mkdir -p $TMP

# need to set this
BUILDDIR=/SET/THIS/TO/PATH/OF/glove/build

# set this to something appropriate for your system
NUM_THREADS=24

# path of single plain text file containing the byte-pair encoded corpus
CORPUS=$1
# where the GloVe files should be saved
OUT=$2
# GloVe embedding dim
VECTOR_SIZE=$3

FNAME=$(echo $CORPUS | sed "s#/#_#g")
SAVE_FILE=$OUT.glove
VERBOSE=2
MEMORY=64.0

# we want embeddings for *all* BPE symbols
VOCAB_MIN_COUNT=0

MAX_ITER=50
WINDOW_SIZE=15
BINARY=0
X_MAX=10


# this part is probably not necessary unless you create lots of embeddings
VOCAB_FILE=$TMP/$FNAME.vocab.txt
COOCCURRENCE_FILE=$TMP/$FNAME.cooccurrence.bin
COOCCURRENCE_SHUF_FILE=$TMP/$FNAME.cooccurrence.shuf.bin
# random filenames for overflow and tempshuf files to prevent naming clashes
OVERFLOW=$TMP/${FNAME}.overflow_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
TEMPSHUF=$TMP/${FNAME}.tempshuf_$(echo $RANDOM $RANDOM $RANDOM $RANDOM $RANDOM | md5sum | cut -c -8)
# create vocab and cooccurrence files only once
if [ ! -f $VOCAB_FILE ]; then
	echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
	$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
fi
if [ ! -f $COOCCURRENCE_FILE ]; then
	echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
	$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE -overflow-file $OVERFLOW < $CORPUS > $COOCCURRENCE_FILE
	if [ -f $OVERFLOW ]; then
		rm $OVERFLOW
	fi
fi
if [ ! -f $COOCCURRENCE_SHUF_FILE ]; then
	echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
	$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -temp-file $TEMPSHUF < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
	if [ -f $TEMPSHUF ]; then
		rm $TEMPSHUF
	fi
fi

# print the command we're running
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03"

# the actual command
# GloVe will cause a segmentation fault for some combinations of large vocabulary sizes and large vector sizes.
# In those cases, changing  alpha and eta slightly fixes the problem ‾\_(ツ)_/‾
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE -write-header 1 -alpha 0.75 -eta 0.03

# delete the <unk> embedding, assumes that <unk> doesn't occur as part of some BPE symbol
sed -i "/<unk>/d" ${SAVE_FILE}.txt

stephantul · 2020-03-26T07:22:37Z

For those interested: I created a python script that creates a sentencepiece model on a training corpus, after which it segments the corpus, and trains BPE embeddings. The end result is an embedding space which is aligned with the sentencepiece model. It doesn't use glove though.

See here: https://github.com/stephantul/piecelearn

shantanu778 · 2020-06-18T13:12:04Z

@bheinzerling I want to use BPEmb, but in your training script you used sentencePiece for training and encoding .
How can I use BPEmb model for data preprocessing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training script #5

Training script #5

lparam commented Nov 26, 2017 •

edited

Loading

alejandrojcastaneira commented Feb 20, 2019

Danil328 commented Apr 10, 2019

bheinzerling commented Apr 10, 2019

Danil328 commented Apr 10, 2019

alejandrojcastaneira commented Jun 6, 2019 •

edited

Loading

bheinzerling commented Jun 10, 2019 •

edited

Loading

stefan-it commented Sep 3, 2019

bheinzerling commented Sep 4, 2019 •

edited

Loading

stephantul commented Mar 26, 2020

shantanu778 commented Jun 18, 2020

Training script #5

Training script #5

Comments

lparam commented Nov 26, 2017 • edited Loading

alejandrojcastaneira commented Feb 20, 2019

Danil328 commented Apr 10, 2019

bheinzerling commented Apr 10, 2019

Danil328 commented Apr 10, 2019

alejandrojcastaneira commented Jun 6, 2019 • edited Loading

bheinzerling commented Jun 10, 2019 • edited Loading

stefan-it commented Sep 3, 2019

bheinzerling commented Sep 4, 2019 • edited Loading

stephantul commented Mar 26, 2020

shantanu778 commented Jun 18, 2020

lparam commented Nov 26, 2017 •

edited

Loading

alejandrojcastaneira commented Jun 6, 2019 •

edited

Loading

bheinzerling commented Jun 10, 2019 •

edited

Loading

bheinzerling commented Sep 4, 2019 •

edited

Loading