We are publishing pre-trained word vectors for Russian language. These vectors were trained on joint Russian Wikipedia and Lenta.ru corpora.
All vectors are 300-dimentional. We used fastText skpip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).
You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.
The pre-trained word vectors are distributed under the License Apache 2.0.
The models can be downloaded from:
Model | Preprocessing | Vectors |
---|---|---|
fastText (skipgram) | tokenize (nltk word_tokenize), lemmatize (pymorphy2) | bin, vec |
fastText (skipgram) | tokenize (nltk word_tokenize), lowercasing | bin, vec |
fastText (skipgram) | tokenize (nltk wordpunсt_tokenize) | bin, vec |
fastText (skipgram) | tokenize (nltk word_tokenize) | bin, vec |
fastText (skipgram) | tokenize (nltk word_tokenize), remove stopwords | bin, vec |
These word vectors were trained with following parameters ([...] is for default value):
- lr [0.1]
- lrUpdateRate [100]
- dim 300
- ws [5]
- epoch [5]
- neg [5]
- loss [softmax]
- pretrainedVectors []
- saveOutput [0]