Skip to content

Conversation

denizs
Copy link
Contributor

@denizs denizs commented May 8, 2017

I just went through the NLP tutorial - which is awesome btw - and got stuck with the CBOW exercise. I'm fairly new to the topic so excuse me if I am missing out a crucial point here.

Right now, word_to_ix is derived from raw_text, which causes e.g. the word 'computer' to be indexed with 58.

If understand correctly, the goal is to predict the word which is in the center of 2 context words on each side respectively, causing the probability distribution to contain len(set(raw_text)) values.

Someone like me who is new to this topic will go ahead and follow the same approach as previously shown in the NGram example:

# ...
self.embeddings = nn.Embeddings(len(word_to_ix), embedding_dimension)
# ...

This will cause the code to break as soon as you hit a context_vector containing a word (such as 'computer') which is indexed higher than word_to_ix's length (in this case 49).

Hence, I propose following boilerplate:

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By retrieving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

make_context_vector(data[0][0], word_to_ix)  # example

Introduced a variable called `vocab` with a value of `set(raw_text)` and `vocab_size` which holds the length of `vocab`
@chsasank
Copy link
Contributor

Hey @denizs,

I'm really sorry for not replying for 20 days. I got caught up in something else. Thanks for catching the bug.

@chsasank chsasank merged commit fe09e37 into pytorch:master May 27, 2017
@denizs
Copy link
Contributor Author

denizs commented May 27, 2017

No worries 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants