-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add new restrict_vocab functionality, most_similar_among #1229
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Some code comments attached. most_similar_among()
should also get a unit test.
gensim/models/keyedvectors.py
Outdated
@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False): | |||
else: | |||
raise KeyError("word '%s' not in vocabulary" % word) | |||
|
|||
def get_words_from_vocab(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the property index2word
list better than this – it already exists, and matches the order of word-vectors in syn0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes! Will change! I'm still keeping the get_words_from_vocab() because when I started out with gensim, it was confusing for me initially what the w2v.wv.vocab keys and items were for!
def get_words_from_vocab(self):
"""
returns the words in the current vocabulary as a list.
"""
if(not self.index2word):
raise ValueError("Vocabulary needs to be built before calling this function")
return self.index2word
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since people might not find index2word
, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors
class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys
.
Also, the truthiness-check of self.index2word
seems unnecessary to me. Simply returning self.index2word
for the caller to draw their own conclusions (from an empty list) should be plenty.
gensim/models/keyedvectors.py
Outdated
@@ -336,6 +346,95 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i | |||
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words] | |||
return result[:topn] | |||
|
|||
def most_similar_among(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid confusion with the restrict_vocab
int parameter, which has a different interpretation, this parameter should have a different name. Maybe among_words
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For lack of imagination : words_list
!
gensim/models/keyedvectors.py
Outdated
if(not suppress_warning): | ||
toWarn = "The following words are not in trained vocabulary : " | ||
toWarn += str(restrict_vocab.difference(vocabulary_words)) | ||
warnings.warn(toWarn, UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this will log an attention-grabbing WARNING even if all words are present. Preferable to only log when problems occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done. thanks for the catch.
gensim/models/keyedvectors.py
Outdated
raise ValueError("None of the words in restrict_vocab exist in current vocabulary") | ||
|
||
restrict_vocab_indices = [self.vocab[word].index for word in words_to_use] | ||
limited = self.syn0norm[restrict_vocab_indices] #syn0norm is an ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this creates a new array (rather than just a view), which might be a significant memory cost, for large models and when the words-of-interest are a large subset of all words. Maybe this is unavoidable, but might there be any way to do the calculation against the indexes in the original syn0norm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. Not storing limited
anymore...
words_list_indices = [self.vocab[word].index for word in words_to_use]
# limited = self.syn0norm[words_list_indices] #syn0norm is an ndarray; storing limited might add a huge memory overhead
else:
raise ValueError("words_list must be set/list of words. Please read doc string")
dists = dot(self.syn0norm[words_list_indices], mean)
result = []
gensim/models/word2vec.py
Outdated
@@ -1180,9 +1180,16 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut | |||
self.wv.syn0[self.wv.vocab[word].index] = weights | |||
logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname)) | |||
|
|||
def get_words_from_vocab(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new KeyedVectors functionality, I don't think we need to add forwarding methods to Word2Vec – those exist to ease the way for older code; any new code can access the KeyedVectors methods directly. (This goes for this method and most_similar_among()
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I was confused at first myself but then I just did it to maintain coding style of word2vec.
Do you think it might be confusing for newcomers if we don't have the forwarding methods? It took me quite a while to figure KeyedVectors as having all after training functionality as word2vec!
Also, most of the word2vec functions don't have docstring
and one has to read the docstring
of w2v.wv to figure things out. That might throw people off. Would you like me to add docstrings for forwarding functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added only this method to class Word2Vec
in order to solve the original "people might not find index2word" problem. All other forwarding functions have been removed.
Thanks for the feedback! Yes, having more docstrings is very useful. Explaining what functions do and forwarding to KeyedVectors for more |
Unit testing is going to be comprehensive(CBOW, SG, cosmul?). I'll commit the changes until now. Adding unit tests soon. |
Please avoid adding new forwarding functions. KeyedVectors is the way to go. |
I've only added docstrings in the first commit to fail tests. Have any idea what broke? Travis says :
I haven't changed anything though but I found this within the test_word2vec.py file (not the test that broke) : if not hasattr(TestWord2VecModel, 'assertLess'):
# workaround for python 2.6
def assertLess(self, a, b, msg=None):
self.assertTrue(a < b, msg="%s is not less than %s" % (a, b))
setattr(TestWord2VecModel, 'assertLess', assertLess) Do you think something similar is needed? |
Thanks for reporting this. Please change the test in |
Ok, could you please also check #1233. If proper, I'll add that fix here as well |
added fix for #1233 |
This is done, when free could you please look this over @gojomo ? Thanks! |
gensim/models/keyedvectors.py
Outdated
@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False): | |||
else: | |||
raise KeyError("word '%s' not in vocabulary" % word) | |||
|
|||
def get_words_from_vocab(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since people might not find index2word
, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors
class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys
.
Also, the truthiness-check of self.index2word
seems unnecessary to me. Simply returning self.index2word
for the caller to draw their own conclusions (from an empty list) should be plenty.
gensim/models/keyedvectors.py
Outdated
@@ -336,6 +346,106 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i | |||
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words] | |||
return result[:topn] | |||
|
|||
def most_similar_among(self, positive=[], negative=[], topn=10, words_list=None, indexer=None, | |||
suppress_warnings=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think suppress_warnings
adds extra complication for little benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, calculating difference in sets (which is needed to generate this warning) can be an unnecessary overhead for people using this code in a production environment. I have added a logging.info
call to inform users of the same.
Sometimes it is not possible to remove all non-vocabulary words from the words-list
, for example, if words-list
is generated during runtime. Here using suppress_warnings
would be ideal.
gensim/models/keyedvectors.py
Outdated
|
||
if not suppress_warnings: | ||
missing_words = words_list.difference(vocabulary_words) | ||
if(not missing_words): # missing_words is empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not usual python if
style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed for every occurrence
gensim/models/keyedvectors.py
Outdated
else: | ||
toWarn = "The following words are not in trained vocabulary : " | ||
toWarn += str(missing_words) | ||
warnings.warn(toWarn, UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems more like something to log than use warnings
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Yes, you're right. The module uses logging throughout, unnecessary import and use of warnings was not a good choice.
gensim/models/word2vec.py
Outdated
@@ -355,6 +355,8 @@ class Word2Vec(utils.SaveLoad): | |||
""" | |||
Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/ | |||
|
|||
If you're finished training a model (=no more updates, only querying), then switch to the :mod:`gensim.models.KeyedVectors` instance in wv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line-length, here and in many following comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried following pep8, please do check now
gensim/models/word2vec.py
Outdated
@@ -1181,33 +1187,83 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut | |||
logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname)) | |||
|
|||
def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None): | |||
""" | |||
Please refer to the documentation for `gensim.models.KeyedVectors.most_similar` | |||
In the future please try and use the `gensim.models.KeyedVectors` instance in wv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enough to say "In the future please use..." (no "try and"). If this is a strong recommendation perhaps method should be marked deprecated. Also, does this sort of comment-cleanup belong in PR with new most_similar_among
feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this and split into a separate PR.
@@ -105,7 +105,7 @@ def testPipeline(self): | |||
text_lda = Pipeline((('features', model,), ('classifier', clf))) | |||
text_lda.fit(corpus, data.target) | |||
score = text_lda.score(corpus, data.target) | |||
self.assertGreater(score, 0.50) | |||
self.assertGreater(score, 0.40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @tmylk requested this change but it seems it shoudl go in its own fixup PR by someone who understands that test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into a separate PR. Thank you though! Taught me a lot about git and got me my first commit :)
gensim/test/test_word2vec.py
Outdated
@@ -461,6 +462,141 @@ def test_cbow_neg(self): | |||
min_count=5, iter=10, workers=2, sample=0) | |||
self.model_sanity(model) | |||
|
|||
def test_most_similar_among_CBOW(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because most_similar_among()
functionality is merely on a set of vectors, it's not dependent on the mode-of-training – so there's no need for multiple models built different ways. Testing against one vector set, trained if necessary in the most simple/default way, would be sufficient. (In fact, it'd ultimately make even more sense to be inside a set of KeyedVectors tests, rather than Word2Vec tests, but since its sibling methods are tested here it's not necessary to make that big change yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim/test/test_word2vec.py
Outdated
self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph']) | ||
|
||
res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab | ||
#Testing Warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted previously, I think the suppress_warnings
toggle adds unnecessary complication. Also, that logging may be more appropriate than warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging is used. Suppress warnings becomes useful in case words-list
is generated on the go and the user does not want to know about missing words.
gensim/test/test_word2vec.py
Outdated
model = word2vec.Word2Vec(sentences, size=2, sg=0, min_count=1, hs=1, negative=0) | ||
self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph']) | ||
|
||
res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usual python style is two-spaces before an end-of-line #
-comment, and a space after the #
before text. (Applies many other places as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all, that you for pointing it out
Thank you for the comprehensive review @gojomo ! I will work on this asap. Edit: asap being 31st! Sorry, I have mid semesters going on here. |
@shubhvachher Please continue this pr instead of opening a new one in #1255. |
Ah, sorry. I am working on develop here; will remember not to do that in the future! So, changes made : Moved drive by fixes into separate PRs and replied to comments above. rebased uptil now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code duplication between the most_similar and most_similar_among
. Suggest adding a new words_list
param to the most_similar
function.
The difference is just
dists = dot(self.syn0norm[words_list_indices], mean)
vs dists = dot(self.syn0norm, mean)
Please raise an exception if both restrict_vocab
and words_list
are passed.
if topn is False: | ||
pass | ||
else: | ||
if suppress_warnings is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be an exception, not a warning. Incorrect input can't be surpressed.
logger.info("This warning is expensive to calculate, " \ | ||
"especially for large words_list. " \ | ||
"If you would rather not remove the missing_words " \ | ||
"from words_list please set the " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better message is "Please intersect with vocabulary words_to_use = vocabulary_words.intersection(words_list) prior to calling the most_similar_among".
Please remove the suppress_warnings
flag.
|
||
words_list_indices = [self.vocab[word].index for word in words_to_use] | ||
# limited = self.syn0norm[words_list_indices] | ||
# Storing 'limited' might add a huge memory overhead so we avoid doing that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please memory profile this code to provide foundation for this statement.
Please remove commented out code.
|
||
""" | ||
|
||
if isinstance(words_list, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a single check and a single raise ValueError
.
The most_similar
function doesn't take a list of int
s so it should not be mentioned here
Ping @shubhvachher . It is a useful functionality, would be good to finish this PR |
Yup, sorry! Finishing up exams here. On this asap. |
Ping @shubhvachher, what status of this PR? Will you finish it soon? |
Yes @menshikh-iv; my bad. Been travelling and working. I do have access to my laptop though. I'd get on this coming weekend if thats ok! |
Ok @shubhvachher, we will wait. |
ping @shubhvachher |
I close current PR because this looks abandoned. |
This is surely my fault for lack of time from my summer work and travels; Will open this again as soon as I get the chance and if uncleared. Apologies, |
@shubhvachher please do. The (semi-automated) closing of stale tickets is a maintenance thing from our side, not an indication we're not interested in the PR! |
Fix #481 .
I have looked at the suggestions there and implemented this fix in a new function most_similar_among. @gojomo. I have tested the results and they work well.
The topn variable there has a new functionality that became necessary due to the restricted vocabulary.
Also added being able to get vocabulary as a list from a trained word2vec model.
After the first review, will work on making other _among functions.