Support for fastText, word2vec, and text embeddings
The largest change is this release is support for reading fastText, word2vec, and text embeddings, in addition to finalfusion embeddings.
- Add support for reading fastText (
Embeddings.read_fasttext()), text (Embeddings.read_text()), textdims (Embeddings.read_text()), and word2vec (Embeddings.read_fasttext()) formats. - Each of these newly-supported formats provides a keyword argument
lossy. If set, the embeddings will be read lossily, permitting invalid UTF-8 in words. - Add the
embedding_similaritymethod, which looks up words that are similar to a given embedding. The method for traditional word-based lookups has been renamed fromsimilaritytoword_similarity. - Iteration over embeddings returned tuples
(word, embedding)in previous releases. Now instances of theEmbeddingclass are returned, which provideword,embedding, andnormproperties.normis the embedding norm before normalization of an embedding using its l2 norm. - Add support for memory mapping quantized embedding matrices.
- Add the
ngram_indicesandsubword_indicesto theVocabclass. These methods return the subword indices for a given word, which can be used to retrieve the subword embeddings individually. Thengram_indicesmethods returns each subword with its index, whereassubword_indicesonly returns the indices. - Update to pyo3 0.8.