Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many-to-many variable-length sequence labeling (such as POS) #3916

Closed
jayinai opened this issue Sep 29, 2016 · 5 comments
Closed

Many-to-many variable-length sequence labeling (such as POS) #3916

jayinai opened this issue Sep 29, 2016 · 5 comments

Comments

@jayinai
Copy link

jayinai commented Sep 29, 2016

Been following some related threads, such as #395, #2654, and #2403, but still cannot sort out how to get it to work. The Keras API doc is already very dated so it's not very helpful for this issue.

So I want to use a pretrained word2vec word presentation + Keras LSTM to do POS tagging.

My first question is: is there a better way to feed in the pretrained vector presentation than the embedding_weights method mentioned at #853?

Say we embed using the method mentioned in #853, and get a (M+2) by N embedding matrix. We also pad the variable-length sentences. Then we have

X_pad.shape = (M, N)
y_pad.shape = (M, N)

where M is the number of sentences in the corpus (in my case 18421), and N is padded sentence length (originals vary from 15-140 so in this case N=140)

Here is how I initialized the model

  model = Sequential()

  # first embedding layer
  model.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=N, mask_zero=True, weights=[embedding_matrix]))

  # hidden layer
  model.add(LSTM(output_dim=hidden_dim, return_sequences=True))

  # output layer
  model.add(TimeDistributed(Dense(num_class, activation='softmax')))

  # compile
  model.compile(loss='categorical_crossentropy', optimizer='adam')

When I run model.fit(X_pad, y_pad), I got this error:

Exception: Error when checking model target: expected timedistributed_1 to have 3 dimensions, but got array with shape (18421, 140)

Been stuck here for a while. Any suggestion is appreciated!

@dieuwkehupkes
Copy link
Contributor

I ran across this problem as well. I am still not sure why this is the case and if this is the desired behaviour, but I did manage to get around it by putting all my output values in separate arrays, i.e.:

X = [[1, 2]] 
X_padded = keras.preprocessing.sequence.pad_sequences(X, dtype='float32', maxlen=3) 
Y = [[[1], [2]]] 
Y_padded = keras.preprocessing.sequence.pad_sequences(Y, maxlen=3, dtype='float32') 

See also #3855, which is about a different sequence to sequence learning with variable length problem, but also mentions this issue.

@jayinai
Copy link
Author

jayinai commented Sep 30, 2016

@dieuwkehupkes thanks for the hint! Turns out one-hot encoding is needed.

And for people who have similar issues, you can solve the problem by creating and feeding the 3-d y_pad_one_hot into the previous model

import numpy as np
from keras.utils.np_utils import to_categorical

# y_pad_one_hot.shape: (M, N, nb_classes)
y_pad_one_hot = np.array([to_categorical(sent_label, nb_classes=nb_classes) for sent_label in y_pad])
model.fit(X_pad, y_pad_one_hot)

Still need to find the best way to mask the padding, though.

@jayinai jayinai closed this as completed Sep 30, 2016
@yangxiufengsia
Copy link

@shuaiw can you provide the detailed value of "nb_classes" and "num_class". I encountered the same problem , please help!

@neingeist
Copy link

@yangxiufengsia num_class/nb_classes is the number of classes.

@parunach
Copy link

@shuaiw If the output is a set of words, the num_class becomes the vocab_size. Assuming that I am expecting an output of 20 words, a one hot encoded Y becomes [vocab_size, max_output_words]. Is this correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants