[WIP] Add new restrict_vocab functionality, most_similar_among #1229

shubhvachher · 2017-03-22T02:57:49Z

Fix #481 .

I have looked at the suggestions there and implemented this fix in a new function most_similar_among. @gojomo. I have tested the results and they work well.

The topn variable there has a new functionality that became necessary due to the restricted vocabulary.

Also added being able to get vocabulary as a list from a trained word2vec model.

After the first review, will work on making other _among functions.

gojomo

Thank you! Some code comments attached. most_similar_among() should also get a unit test.

gojomo · 2017-03-22T03:50:23Z

gensim/models/keyedvectors.py

@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False):
        else:
            raise KeyError("word '%s' not in vocabulary" % word)

+    def get_words_from_vocab(self):


Isn't the property index2word list better than this – it already exists, and matches the order of word-vectors in syn0?

Oh, yes! Will change! I'm still keeping the get_words_from_vocab() because when I started out with gensim, it was confusing for me initially what the w2v.wv.vocab keys and items were for!

def get_words_from_vocab(self): """ returns the words in the current vocabulary as a list. """ if(not self.index2word): raise ValueError("Vocabulary needs to be built before calling this function") return self.index2word

Since people might not find index2word, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys.

Also, the truthiness-check of self.index2word seems unnecessary to me. Simply returning self.index2word for the caller to draw their own conclusions (from an empty list) should be plenty.

gojomo · 2017-03-22T03:51:35Z

gensim/models/keyedvectors.py

@@ -336,6 +346,95 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
        result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
        return result[:topn]

+    def most_similar_among(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None,


To avoid confusion with the restrict_vocab int parameter, which has a different interpretation, this parameter should have a different name. Maybe among_words?

For lack of imagination : words_list !

gojomo · 2017-03-22T04:23:40Z

gensim/models/keyedvectors.py

+            if(not suppress_warning):
+                toWarn = "The following words are not in trained vocabulary : "
+                toWarn += str(restrict_vocab.difference(vocabulary_words))
+                warnings.warn(toWarn, UserWarning)


Seems this will log an attention-grabbing WARNING even if all words are present. Preferable to only log when problems occur.

Ok, done. thanks for the catch.

gojomo · 2017-03-22T04:28:30Z

gensim/models/keyedvectors.py

+                raise ValueError("None of the words in restrict_vocab exist in current vocabulary")
+
+            restrict_vocab_indices = [self.vocab[word].index for word in words_to_use]
+            limited = self.syn0norm[restrict_vocab_indices] #syn0norm is an ndarray


I think this creates a new array (rather than just a view), which might be a significant memory cost, for large models and when the words-of-interest are a large subset of all words. Maybe this is unavoidable, but might there be any way to do the calculation against the indexes in the original syn0norm?

Sure, done. Not storing limited anymore...

words_list_indices = [self.vocab[word].index for word in words_to_use] # limited = self.syn0norm[words_list_indices] #syn0norm is an ndarray; storing limited might add a huge memory overhead else: raise ValueError("words_list must be set/list of words. Please read doc string") dists = dot(self.syn0norm[words_list_indices], mean) result = []

gojomo · 2017-03-22T04:29:53Z

gensim/models/word2vec.py

@@ -1180,9 +1180,16 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
                        self.wv.syn0[self.wv.vocab[word].index] = weights
        logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname))

+    def get_words_from_vocab(self):


For new KeyedVectors functionality, I don't think we need to add forwarding methods to Word2Vec – those exist to ease the way for older code; any new code can access the KeyedVectors methods directly. (This goes for this method and most_similar_among().)

hmm, I was confused at first myself but then I just did it to maintain coding style of word2vec.

Do you think it might be confusing for newcomers if we don't have the forwarding methods? It took me quite a while to figure KeyedVectors as having all after training functionality as word2vec!

Also, most of the word2vec functions don't have docstring and one has to read the docstring of w2v.wv to figure things out. That might throw people off. Would you like me to add docstrings for forwarding functions?

I have added only this method to class Word2Vec in order to solve the original "people might not find index2word" problem. All other forwarding functions have been removed.

tmylk · 2017-03-22T13:56:40Z

Thanks for the feedback! Yes, having more docstrings is very useful. Explaining what functions do and forwarding to KeyedVectors for more

shubhvachher · 2017-03-22T14:04:52Z

Unit testing is going to be comprehensive(CBOW, SG, cosmul?). I'll commit the changes until now. Adding unit tests soon.

shubhvachher · 2017-03-22T14:18:25Z

@tmylk Will do. Also, what do you think about @gojomo's last review suggestion? Should I add forwarding functions to the Word2Vec file for new functions in KeyedVectors or are we going to ask people to use the KeyedVectors class always?

tmylk · 2017-03-22T15:16:33Z

Please avoid adding new forwarding functions. KeyedVectors is the way to go.

shubhvachher · 2017-03-22T21:20:15Z

I've only added docstrings in the first commit to fail tests. Have any idea what broke? Travis says :

======================================================================

FAIL: testPipeline (gensim.test.test_sklearn_integration.TestSklearnLDAWrapper)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_sklearn_integration.py", line 108, in testPipeline

    self.assertGreater(score, 0.50)

AssertionError: 0.5 not greater than 0.5

----------------------------------------------------------------------

I haven't changed anything though but I found this within the test_word2vec.py file (not the test that broke) :

if not hasattr(TestWord2VecModel, 'assertLess'):
    # workaround for python 2.6
    def assertLess(self, a, b, msg=None):
        self.assertTrue(a < b, msg="%s is not less than %s" % (a, b))

    setattr(TestWord2VecModel, 'assertLess', assertLess)

Do you think something similar is needed?

tmylk · 2017-03-22T21:36:46Z

Thanks for reporting this. Please change the test in testPipeline to be self.assertGreater(score, 0.40). It is a new test.

shubhvachher · 2017-03-22T21:52:14Z

Ok, could you please also check #1233. If proper, I'll add that fix here as well

shubhvachher · 2017-03-22T22:50:32Z

added fix for #1233

shubhvachher · 2017-03-24T11:09:32Z

This is done, when free could you please look this over @gojomo ? Thanks!

gojomo · 2017-03-24T23:02:13Z

gensim/models/keyedvectors.py

@@ -272,6 +273,15 @@ def word_vec(self, word, use_norm=False):
        else:
            raise KeyError("word '%s' not in vocabulary" % word)

+    def get_words_from_vocab(self):


Since people might not find index2word, I can see the benefit of an accessor method. However, given that this is now a generic KeyedVectors class, the name shouldn't be 'words'-centric. I'd suggest ordered_keys.

Also, the truthiness-check of self.index2word seems unnecessary to me. Simply returning self.index2word for the caller to draw their own conclusions (from an empty list) should be plenty.

gojomo · 2017-03-24T23:06:14Z

gensim/models/keyedvectors.py

@@ -336,6 +346,106 @@ def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, i
        result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
        return result[:topn]

+    def most_similar_among(self, positive=[], negative=[], topn=10, words_list=None, indexer=None,
+                            suppress_warnings=False):


I think suppress_warnings adds extra complication for little benefit.

Actually, calculating difference in sets (which is needed to generate this warning) can be an unnecessary overhead for people using this code in a production environment. I have added a logging.info call to inform users of the same.
Sometimes it is not possible to remove all non-vocabulary words from the words-list, for example, if words-list is generated during runtime. Here using suppress_warnings would be ideal.

gojomo · 2017-03-24T23:08:11Z

gensim/models/keyedvectors.py

+
+            if not suppress_warnings:
+                missing_words = words_list.difference(vocabulary_words)
+                if(not missing_words): # missing_words is empty


Not usual python if style.

Changed for every occurrence

gojomo · 2017-03-24T23:09:03Z

gensim/models/keyedvectors.py

+                else:
+                    toWarn = "The following words are not in trained vocabulary : "
+                    toWarn += str(missing_words)
+                    warnings.warn(toWarn, UserWarning)


Seems more like something to log than use warnings.

Done. Yes, you're right. The module uses logging throughout, unnecessary import and use of warnings was not a good choice.

gojomo · 2017-03-24T23:10:34Z

gensim/models/word2vec.py

@@ -355,6 +355,8 @@ class Word2Vec(utils.SaveLoad):
    """
    Class for training, using and evaluating neural networks described in https://code.google.com/p/word2vec/

+    If you're finished training a model (=no more updates, only querying), then switch to the :mod:`gensim.models.KeyedVectors` instance in wv


Line-length, here and in many following comments.

Tried following pep8, please do check now

gojomo · 2017-03-24T23:12:06Z

gensim/models/word2vec.py

@@ -1181,33 +1187,83 @@ def intersect_word2vec_format(self, fname, lockf=0.0, binary=False, encoding='ut
        logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.wv.syn0.shape, fname))

    def most_similar(self, positive=[], negative=[], topn=10, restrict_vocab=None, indexer=None):
+        """
+        Please refer to the documentation for `gensim.models.KeyedVectors.most_similar`
+        In the future please try and use the `gensim.models.KeyedVectors` instance in wv


Enough to say "In the future please use..." (no "try and"). If this is a strong recommendation perhaps method should be marked deprecated. Also, does this sort of comment-cleanup belong in PR with new most_similar_among feature?

Changed this and split into a separate PR.

gojomo · 2017-03-24T23:13:00Z

gensim/test/test_sklearn_integration.py

@@ -105,7 +105,7 @@ def testPipeline(self):
        text_lda = Pipeline((('features', model,), ('classifier', clf)))
        text_lda.fit(corpus, data.target)
        score = text_lda.score(corpus, data.target)
-        self.assertGreater(score, 0.50)
+        self.assertGreater(score, 0.40)


I know @tmylk requested this change but it seems it shoudl go in its own fixup PR by someone who understands that test.

Split into a separate PR. Thank you though! Taught me a lot about git and got me my first commit :)

gojomo · 2017-03-25T00:08:38Z

gensim/test/test_word2vec.py

@@ -461,6 +462,141 @@ def test_cbow_neg(self):
                                  min_count=5, iter=10, workers=2, sample=0)
        self.model_sanity(model)

+    def test_most_similar_among_CBOW(self):


Because most_similar_among() functionality is merely on a set of vectors, it's not dependent on the mode-of-training – so there's no need for multiple models built different ways. Testing against one vector set, trained if necessary in the most simple/default way, would be sufficient. (In fact, it'd ultimately make even more sense to be inside a set of KeyedVectors tests, rather than Word2Vec tests, but since its sibling methods are tested here it's not necessary to make that big change yet.)

Sure, Removed extra tests; using CBOW method to check functionality.

That second part might be important given @tmylk's comment

KeyedVectors is the way to go.

What do you think @tmylk ?

gojomo · 2017-03-25T00:10:26Z

gensim/test/test_word2vec.py

+        self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph'])
+
+        res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab
+        #Testing Warnings


As noted previously, I think the suppress_warnings toggle adds unnecessary complication. Also, that logging may be more appropriate than warnings.

logging is used. Suppress warnings becomes useful in case words-list is generated on the go and the user does not want to know about missing words.

gojomo · 2017-03-25T00:11:45Z

gensim/test/test_word2vec.py

+        model = word2vec.Word2Vec(sentences, size=2, sg=0, min_count=1, hs=1, negative=0)
+        self.assertRaises(ValueError, model.wv.most_similar_among, positive=['graph'])
+
+        res_voc = model.wv.index2word[:5] #Gives first 5 words in model vocab


Usual python style is two-spaces before an end-of-line #-comment, and a space after the # before text. (Applies many other places as well.)

Changed all, that you for pointing it out

shubhvachher · 2017-03-25T08:03:49Z

Thank you for the comprehensive review @gojomo ! I will work on this asap.

Edit: asap being 31st! Sorry, I have mid semesters going on here.

tmylk · 2017-04-03T21:02:20Z

@shubhvachher Please continue this pr instead of opening a new one in #1255.

shubhvachher · 2017-04-06T14:27:07Z

Ah, sorry. I am working on develop here; will remember not to do that in the future!

So, changes made : Moved drive by fixes into separate PRs and replied to comments above. rebased uptil now.

tmylk

There is a lot of code duplication between the most_similar and most_similar_among. Suggest adding a new words_list param to the most_similar function.

The difference is just
dists = dot(self.syn0norm[words_list_indices], mean) vs dists = dot(self.syn0norm, mean)

Please raise an exception if both restrict_vocab and words_list are passed.

tmylk · 2017-04-10T21:08:00Z

gensim/models/keyedvectors.py

+            if topn is False:
+                pass
+            else:
+                if suppress_warnings is False:


This must be an exception, not a warning. Incorrect input can't be surpressed.

tmylk · 2017-04-10T21:11:35Z

gensim/models/keyedvectors.py

+                logger.info("This warning is expensive to calculate, " \
+                                "especially for large words_list. " \
+                                "If you would rather not remove the missing_words " \
+                                "from words_list please set the " \


Better message is "Please intersect with vocabulary words_to_use = vocabulary_words.intersection(words_list) prior to calling the most_similar_among".
Please remove the suppress_warnings flag.

tmylk · 2017-04-10T21:13:04Z

gensim/models/keyedvectors.py

+
+        words_list_indices = [self.vocab[word].index for word in words_to_use]
+        # limited = self.syn0norm[words_list_indices]
+        # Storing 'limited' might add a huge memory overhead so we avoid doing that


Please memory profile this code to provide foundation for this statement.
Please remove commented out code.

tmylk · 2017-04-10T21:20:51Z

gensim/models/keyedvectors.py

+
+        """
+
+        if isinstance(words_list, int):


Please use a single check and a single raise ValueError.
The most_similar function doesn't take a list of ints so it should not be mentioned here

tmylk · 2017-05-02T19:52:07Z

Ping @shubhvachher . It is a useful functionality, would be good to finish this PR

shubhvachher · 2017-05-02T20:03:05Z

Yup, sorry! Finishing up exams here. On this asap.

menshikh-iv · 2017-06-13T10:29:53Z

Ping @shubhvachher, what status of this PR? Will you finish it soon?

shubhvachher · 2017-06-13T13:29:52Z

Yes @menshikh-iv; my bad. Been travelling and working. I do have access to my laptop though. I'd get on this coming weekend if thats ok!

menshikh-iv · 2017-06-13T14:53:32Z

Ok @shubhvachher, we will wait.

menshikh-iv · 2017-07-07T05:53:39Z

ping @shubhvachher

menshikh-iv · 2017-07-13T17:08:44Z

I close current PR because this looks abandoned.

shubhvachher · 2017-07-30T16:52:54Z

This is surely my fault for lack of time from my summer work and travels; Will open this again as soon as I get the chance and if uncleared. Apologies,
Shubh

piskvorky · 2017-07-31T08:49:56Z

@shubhvachher please do. The (semi-automated) closing of stale tickets is a maintenance thing from our side, not an indication we're not interested in the PR!

gojomo reviewed Mar 22, 2017

View reviewed changes

shubhvachher mentioned this pull request Mar 22, 2017

testTrainingSgNegative is possibly incorrect #1233

Closed

gojomo reviewed Mar 25, 2017

View reviewed changes

shubhvachher closed this Mar 30, 2017

shubhvachher force-pushed the develop branch from 0532229 to 411d708 Compare March 30, 2017 16:58

shubhvachher mentioned this pull request Apr 1, 2017

most_similar_among functionality added with tests #1255

Closed

shubhvachher reopened this Apr 6, 2017

shubhvachher mentioned this pull request Apr 10, 2017

Word2Vec model to dict; Adding to the word2vec to production pipeline #1269

Closed

tmylk suggested changes Apr 10, 2017

View reviewed changes

tmylk changed the title ~~Added new restrict_vocab functionality, most_similar_among~~ [WIP] Add new restrict_vocab functionality, most_similar_among Apr 10, 2017

shubhvachher added 4 commits April 11, 2017 19:47

most_similar_among functionality added with tests

2b69e3b

Line lengths pep8 fixed

59feafd

Fix test

f85ae9f

Fix Pep8 strings, list and logging

b5c813d

shubhvachher force-pushed the develop branch from 3830789 to b5c813d Compare April 11, 2017 14:34

tmylk mentioned this pull request May 2, 2017

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

Closed

menshikh-iv added the almost complete label Jul 13, 2017

menshikh-iv closed this Jul 13, 2017

gojomo mentioned this pull request Sep 14, 2017

Add "most_similar_to_given" method for KeyedVectors #1582

Merged

[WIP] Add new restrict_vocab functionality, most_similar_among #1229

[WIP] Add new restrict_vocab functionality, most_similar_among #1229

Conversation

shubhvachher commented Mar 22, 2017

gojomo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

shubhvachher Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

tmylk commented Mar 22, 2017

shubhvachher commented Mar 22, 2017 • edited Loading

shubhvachher commented Mar 22, 2017

tmylk commented Mar 22, 2017

shubhvachher commented Mar 22, 2017

tmylk commented Mar 22, 2017

shubhvachher commented Mar 22, 2017 • edited Loading

shubhvachher commented Mar 22, 2017

shubhvachher commented Mar 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubhvachher commented Mar 25, 2017 • edited Loading

tmylk commented Apr 3, 2017

shubhvachher commented Apr 6, 2017 • edited Loading

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented May 2, 2017

shubhvachher commented May 2, 2017

menshikh-iv commented Jun 13, 2017

shubhvachher commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017

menshikh-iv commented Jul 7, 2017

menshikh-iv commented Jul 13, 2017

shubhvachher commented Jul 30, 2017

piskvorky commented Jul 31, 2017

shubhvachher Mar 22, 2017 •

edited

Loading

shubhvachher Mar 22, 2017 •

edited

Loading

shubhvachher Apr 6, 2017 •

edited

Loading

shubhvachher commented Mar 22, 2017 •

edited

Loading

shubhvachher commented Mar 22, 2017 •

edited

Loading

shubhvachher Apr 6, 2017 •

edited

Loading

shubhvachher Apr 6, 2017 •

edited

Loading

shubhvachher Apr 6, 2017 •

edited

Loading

shubhvachher commented Mar 25, 2017 •

edited

Loading

shubhvachher commented Apr 6, 2017 •

edited

Loading