subsampling #1

Mayar2009 · 2019-10-03T08:39:18Z

in def gen_vocab(self) we select the vocab that have number of freq >=self.min_count like this:

`vocab, word2id, id2word = {}, {}, {}
```
    index = 0
```

    for item_id, freq in vocab_freq_dict.items():

```
        if freq < self.min_count:
```
```
            continue
```
```
        vocab[item_id] = freq
```
```
        word2id[item_id] = index
```
```
        id2word[index] = item_id
```
```
        index += 1
```

    return vocab, word2id, id2word, total_word_count, total_sent_count`

can you please clarify this function (def gen_subsample_table(self))?

`

def gen_subsample_table(self):
```
    """
```

    sub sampling rate, higher than that would be sub sampled using

        the word2vec paper using:    p(w_i) = 1 - sqrt(sub_sampling / freq)

        the word2vec code using:     p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq)

    we use word2vec code sub sampling method here.

```
    :return: {word_id: sample_score}
```
```
    """
```
```
    def sub_sampling(_freq):
```

        return (self.sub_sampling_t / 1.0 / _freq) ** 0.5 + self.sub_sampling_t / 1.0 / _freq

    # word freq count to word freq ratio

    sub_sample_tbl = {item: freq / 1.0 / self.total_word_count

                      for item, freq in self.vocab.items()

                      if freq / 1.0 / self.total_word_count > self.sub_sampling_t}

```
    # freq to score
```

    sub_sample_tbl = {item: sub_sampling(_freq) for item, _freq in sub_sample_tbl.items()}

```
    # word to id
```

    sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}

```
    return sub_sample_tbl
```

`
line 9
9. def sub_sampling(_freq): it looks like it returns ( p(w_i) = (sqrt(sub_sampling / freq) + sub_sampling / freq) ) not ( p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq) ) right?

why this line ?
14. if freq / 1.0 / self.total_word_count > self.sub_sampling_t}
if we before used
4. if freq < self.min_count: in the def gen_vocab(self) function in the first part of the question

what is the meaning of this line?

    sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}

thank you!

The text was updated successfully, but these errors were encountered:

Mayar2009 · 2019-10-06T18:19:45Z

and a question please
where did you use sub_sampling_table?
in code it is not used anywhere
it is so strange

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subsampling #1

subsampling #1

Mayar2009 commented Oct 3, 2019 •

edited

Loading

Mayar2009 commented Oct 6, 2019

subsampling #1

subsampling #1

Comments

Mayar2009 commented Oct 3, 2019 • edited Loading

Mayar2009 commented Oct 6, 2019

Mayar2009 commented Oct 3, 2019 •

edited

Loading