Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsampling #1

Open
Mayar2009 opened this issue Oct 3, 2019 · 1 comment
Open

subsampling #1

Mayar2009 opened this issue Oct 3, 2019 · 1 comment

Comments

@Mayar2009
Copy link

Mayar2009 commented Oct 3, 2019

in def gen_vocab(self) we select the vocab that have number of freq >=self.min_count like this:

  1. `vocab, word2id, id2word = {}, {}, {}
  2.     index = 0
    
  3.     for item_id, freq in vocab_freq_dict.items():
    
  4.         if freq < self.min_count:
    
  5.             continue
    
  6.         vocab[item_id] = freq
    
  7.         word2id[item_id] = index
    
  8.         id2word[index] = item_id
    
  9.         index += 1
    
  10.     return vocab, word2id, id2word, total_word_count, total_sent_count`
    

can you please clarify this function (def gen_subsample_table(self))?

`

  1. def gen_subsample_table(self):
  2.     """
    
  3.     sub sampling rate, higher than that would be sub sampled using
    
  4.         the word2vec paper using:    p(w_i) = 1 - sqrt(sub_sampling / freq)
    
  5.         the word2vec code using:     p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq)
    
  6.     we use word2vec code sub sampling method here.
    
  7.     :return: {word_id: sample_score}
    
  8.     """
    
  9.     def sub_sampling(_freq):
    
  10.         return (self.sub_sampling_t / 1.0 / _freq) ** 0.5 + self.sub_sampling_t / 1.0 / _freq
    
  11.     # word freq count to word freq ratio
    
  12.     sub_sample_tbl = {item: freq / 1.0 / self.total_word_count
    
  13.                       for item, freq in self.vocab.items()
    
  14.                       if freq / 1.0 / self.total_word_count > self.sub_sampling_t}
    
  15.     # freq to score
    
  16.     sub_sample_tbl = {item: sub_sampling(_freq) for item, _freq in sub_sample_tbl.items()}
    
  17.     # word to id
    
  18.     sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}
    
  19.     return sub_sample_tbl
    

`
line 9
9. def sub_sampling(_freq): it looks like it returns ( p(w_i) = (sqrt(sub_sampling / freq) + sub_sampling / freq) ) not ( p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq) ) right?

why this line ?
14. if freq / 1.0 / self.total_word_count > self.sub_sampling_t}
if we before used
4. if freq < self.min_count: in the def gen_vocab(self) function in the first part of the question

what is the meaning of this line?

  1.     sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}
    

thank you!

@Mayar2009
Copy link
Author

and a question please
where did you use sub_sampling_table?
in code it is not used anywhere
it is so strange

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant