Word2Vec model to dict; Adding to the word2vec to production pipeline #1269
Labels
difficulty easy
Easy issue: required small fix
feature
Issue described a new feature
good first issue
Issue for new contributors (not required gensim understanding + very simple)
A lot of users use their trained
word2vec
model in production environments to getmost_similar
words to (for example) words in a user's entered query, or words in complete documents, on the fly. In times like these, using the word2vec model becomes very cumbersome, often taking the most amount of time in the pipeline. [1]What I propose is a
model_to_dict
method, to be used right at the end of the word2vec pipeline. It would find and store, in preproduction, all most similar words to all words in the trained vocabulary.The most similar words can be from a custom user list as in #1229 and we can allow the user to define a custom preprocessing function to pass all most similar words through before storing them. Being a dict, the query time will also be minimal which is great for this purpose! Since our dict just stores words, the size of the dict should be comparable to a multiple of the size of the vocab [2]
At the end of this, we expect a dictionary with keys as word2vec vocab and values as the most_similar words to them. Words in vocab that have empty most_similar words will not be stored in the dict. This will happen a lot if there is a custom results list as well as a pass function for result words applied on top of most similar cutoff.
[1] This is because we always calculate all cos distances from the query word to all words in vocab before returning topn most similar words. I can't think of a better way for that, yet.
[2] albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same.
The text was updated successfully, but these errors were encountered: