Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

shubhvachher · 2017-04-10T11:19:35Z

A lot of users use their trained word2vec model in production environments to get most_similar words to (for example) words in a user's entered query, or words in complete documents, on the fly. In times like these, using the word2vec model becomes very cumbersome, often taking the most amount of time in the pipeline. [1]

What I propose is a model_to_dict method, to be used right at the end of the word2vec pipeline. It would find and store, in preproduction, all most similar words to all words in the trained vocabulary.

The most similar words can be from a custom user list as in #1229 and we can allow the user to define a custom preprocessing function to pass all most similar words through before storing them. Being a dict, the query time will also be minimal which is great for this purpose! Since our dict just stores words, the size of the dict should be comparable to a multiple of the size of the vocab [2]

At the end of this, we expect a dictionary with keys as word2vec vocab and values as the most_similar words to them. Words in vocab that have empty most_similar words will not be stored in the dict. This will happen a lot if there is a custom results list as well as a pass function for result words applied on top of most similar cutoff.

[1] This is because we always calculate all cos distances from the query word to all words in vocab before returning topn most similar words. I can't think of a better way for that, yet.
[2] albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same.

The text was updated successfully, but these errors were encountered:

gojomo · 2017-04-11T22:36:28Z

I can see this being useful. However, it could take a lot of time/memory to compute. And, it seems like a 1-liner:

most_similars_precalc = {k : model.wv.most_similar(k) for k in model.wv.index2word}

(The variants would be slightly different if working with some subset of the vocabulary.)

So, this might be more appropriate as some examples in one of the documentation notebooks (with proper caveats about the time/memory cost of the every-word calculations).

tmylk · 2017-05-02T22:46:00Z

@shubhvachher Suggest adding it to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

menshikh-iv · 2017-10-02T10:19:42Z

Can you add this to notebook @shubhvachher?

kakshay21 · 2017-12-10T10:39:45Z

@menshikh-iv @tmylk @gojomo
"albeit a large multiple if users do not provide proper preprocessing pass function or have a small similarity cutoff. Maybe we can give them a warning about the same."
Is that required on the Jupyter notebook?
Does the method have to give some additional parameter as well?

menshikh-iv added difficulty easy Easy issue: required small fix feature Issue described a new feature test before incubator labels Oct 2, 2017

menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) and removed test before incubator labels Oct 16, 2017

poornagurram mentioned this issue Oct 26, 2017

PR related to issue #1269 #1655

Closed

This was referenced Dec 10, 2017

Add model to dict method #1775

Closed

Add model_to_dict one-liner to word2vec notebook. Fix #1269 #1776

Merged

menshikh-iv closed this as completed in bf1b865 Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

shubhvachher commented Apr 10, 2017 •

edited

Loading

gojomo commented Apr 11, 2017

tmylk commented May 2, 2017

menshikh-iv commented Oct 2, 2017

kakshay21 commented Dec 10, 2017 •

edited

Loading

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

Comments

shubhvachher commented Apr 10, 2017 • edited Loading

gojomo commented Apr 11, 2017

tmylk commented May 2, 2017

menshikh-iv commented Oct 2, 2017

kakshay21 commented Dec 10, 2017 • edited Loading

shubhvachher commented Apr 10, 2017 •

edited

Loading

kakshay21 commented Dec 10, 2017 •

edited

Loading