-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Added method to restrict vocab of Word2Vec most similar search #481
Conversation
Nice, that sounds really useful! I'm thinking this should even be the "default" (promoted) way of using the It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible. We probably don't need an extra |
Cool, if you like it then there is no reason to have both methods. I made it separate so I could test against the original and made sure they matched. I'll rename |
@jimgoo Talking about testing...Could you please add some for the new feature to /test/test_word2vec.py#L175 |
@jimgoo Please fix the Python 3 syntax issues, add CHANGELOG and test. |
Hey @jimgoo, please post another commit to trigger the Travis build. And ignore appveyor test failures for now - we are working to fix them. But I would expect Travis unix tests to be green after the next commit. |
+1 @jimgoo, I think it's just a print statement. |
It seems that this PR is not incorporated in the latest gensim. Have any update? |
Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.) Also, the PR seems to include other unrelated code for other features. I would suggest splitting such functionality off into a different method, perhaps |
I had built this out for a recent project. Should I complete this issue? |
Duplicate of #1229 |
I've added a method to
gensim.models.word2vec.py
:which allows
restrict_vocab
to be a list containing words to restrict the search over.For example, these are the top 10 most similar results using the original
most_similar
method:And we can restrict the search to a list of words with the new
most_similar_in_list
method:Passing an integer for
restrict_vocab
has the same behavior as the original,For large vocabularies, there is some benefit to reducing the number of rows in
limited
when you're only interested in a subset of words:The number of rows is
len(restrict_vocab)
rather than the total number of words in the vocab.