Added `vocab` and `vocab_size` to CBOW exercise #80

denizs · 2017-05-08T11:27:18Z

I just went through the NLP tutorial - which is awesome btw - and got stuck with the CBOW exercise. I'm fairly new to the topic so excuse me if I am missing out a crucial point here.

Right now, word_to_ix is derived from raw_text, which causes e.g. the word 'computer' to be indexed with 58.

If understand correctly, the goal is to predict the word which is in the center of 2 context words on each side respectively, causing the probability distribution to contain len(set(raw_text)) values.

Someone like me who is new to this topic will go ahead and follow the same approach as previously shown in the NGram example:

# ...
self.embeddings = nn.Embeddings(len(word_to_ix), embedding_dimension)
# ...

This will cause the code to break as soon as you hit a context_vector containing a word (such as 'computer') which is indexed higher than word_to_ix's length (in this case 49).

Hence, I propose following boilerplate:

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By retrieving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

make_context_vector(data[0][0], word_to_ix)  # example

Introduced a variable called `vocab` with a value of `set(raw_text)` and `vocab_size` which holds the length of `vocab`

chsasank · 2017-05-27T18:29:52Z

Hey @denizs,

I'm really sorry for not replying for 20 days. I got caught up in something else. Thanks for catching the bug.

denizs · 2017-05-27T19:37:41Z

No worries 🙂

Added vocab and vocab_size to CBOW exercise

cdab611

Introduced a variable called `vocab` with a value of `set(raw_text)` and `vocab_size` which holds the length of `vocab`

chsasank merged commit fe09e37 into pytorch:master May 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added `vocab` and `vocab_size` to CBOW exercise #80

Added `vocab` and `vocab_size` to CBOW exercise #80

Uh oh!

denizs commented May 8, 2017

Uh oh!

chsasank commented May 27, 2017

Uh oh!

denizs commented May 27, 2017

Uh oh!

Uh oh!

Added vocab and vocab_size to CBOW exercise #80

Added vocab and vocab_size to CBOW exercise #80

Uh oh!

Conversation

denizs commented May 8, 2017

Uh oh!

chsasank commented May 27, 2017

Uh oh!

denizs commented May 27, 2017

Uh oh!

Uh oh!

Added `vocab` and `vocab_size` to CBOW exercise #80

Added `vocab` and `vocab_size` to CBOW exercise #80