-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to train a many to many sequence labeling using LSTM and BLSTM respectively? #2654
Comments
Are you padding your inputs?, using an embedding layer, etc? A basic structure: xin = Input(batch_shape=(batch, timesteps), dtype='int32')
xemb = Embedding(vocab_size, embedding_size)(xin) # 3dim (batch,time,feat)
seq = LSTM(seq_size, return_sequences=True)(xemb)
mlp = TimeDistributed(Dense(mlp_size, activation='softmax'))(seq)
model = Model(input=xin, output=mlp)
model.compile(optimizer='Adam', loss='categorical_crossentropy') an example of how you could serve your sentence/sequence data: def serve_sentence(self, data):
for data_i in np.random.choice(len(data), len(data), replace=False):
in_X = np.zeros(self.max_sequence_len)
out_Y = np.zeros(self.max_sequence_len, dtype=np.int32)
bigram_data = zip(data[data_i][0:-1], data[data_i][1:])
for datum_j,(datum_in, datum_out) in enumerate(bigram_data):
in_X[datum_j] = datum_in
out_Y[datum_j] = datum_out
yield in_X, out_Y
def serve_batch(self, data):
dataiter = self.serve_sentence(data)
V = self.vocab_size
S = self.max_sequence_len
B = self.batch_size
while dataiter:
in_X = np.zeros((B, S), dtype=np.int32)
out_Y = np.zeros((B, S, V), dtype=np.int32)
next_batch = list(itertools.islice(dataiter, 0, self.batch_size))
if len(next_batch) < self.batch_size:
raise StopIteration
for d_i, (d_X, d_Y) in enumerate(next_batch):
in_X[d_i] = d_X
out_Y[d_i] = to_categorical(d_Y, V)
yield in_X, out_Y |
@braingineer Thank you for your help. model = Sequential()
model.add(Embedding(DICT_SIZE, EMBED_SIZE, input_length=MAX_SENTENCE_LEN))
model.add(LSTM(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(NUM_CLASS, activation='softmax')))
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy']) But I got poor result on my test set. I got a low F1 score(if you don't hear about it, you can treat it as accuracy), 41%. def gen_onefile(indexfile, sen_len=150):
with open(indexfile, 'r') as inf:
inputs = []
for line in inf:
inputs += [int(num) for num in line.split()]
ult_in=[]
for i in range(len(inputs)/sen_len):
ult_in.append(inputs[i*sen_len: (i+1)*sen_len])
inputs = np.array(ult_in).astype(np.int32)
return inputs
def get_X_y(indexfile, labelfile, sen_len=150):
inputs = gen_onefile(indexfile, sen_len)
labels = gen_onefile(labelfile, sen_len)
assert len(inputs) == len(labels), "not equal"
return inputs, labels indexfile like: X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ] inputs will be [[123, 2, 3], [4, 5, 22], [10, 2, 1]] Have I done wrong somewhere? |
lol. I'm in academia. i hear about it too often.
why? the method you described doesn't make any sense. what you're basically telling your learner by concatenating all of your data is "hey, sequences that are 150 words long and run into each other are important, so learn to predict this". but there's no real patterns that follow that, or at least there isn't until you get infinite data. padding gives you a way to input sentences that don't confuse each other. Padding does not affect you statistically. internally, masked elements are ignored (in the RNNs, they are passed on, and in the loss measurements, they are used to adjust the loss). Though, there is some care to be taken if you calculate perplexity. I've on here about it, but I haven't had time to submit a proper PR for it. As far as making your model better: it's a very weak model. That may also be the issue. Though, if your data is simple enough, it shouldn't be too much of a problem. I made a simple model with keras a couple weeks ago and got very near to the state of the art on the language modeling version of the WSJ corpus (0-20, 21-22 for dev, 23-24 for test). I've pasted it below. I also initialized my embeddings with 300d glove (you can get these from the stanford website). B = self.igor.batch_size
R = self.igor.rnn_size
S = self.igor.max_sequence_len
V = self.igor.vocab_size
E = self.igor.embedding_size
emb_W = self.igor.embeddings.astype(theano.config.floatX)
## dropout parameters
p_emb = self.igor.p_emb_dropout
p_W = self.igor.p_W_dropout
p_U = self.igor.p_U_dropout
p_dense = self.igor.p_dense_dropout
w_decay = self.igor.weight_decay
M = Sequential()
M.add(Embedding(V, E, batch_input_shape=(B,S),
W_regularizer=l2(w_decay),
weights=[emb_W], mask_zero=True, dropout=p_emb))
#for i in range(self.igor.num_lstms):
M.add(LSTM(R, return_sequences=True, dropout_W=p_W, dropout_U=p_U,
U_regularizer=l2(w_decay), W_regularizer=l2(w_decay)))
M.add(Dropout(p_dense))
M.add(LSTM(R*int(1/p_dense), return_sequences=True, dropout_W=p_W, dropout_U=p_U))
M.add(Dropout(p_dense))
M.add(TimeDistributed(Dense(V, activation='softmax',
W_regularizer=l2(w_decay), b_regularizer=l2(w_decay))))
print("compiling")
optimizer = Adam(self.igor.LR, clipnorm=self.igor.max_grad_norm,
clipvalue=5.0)
#optimizer = SGD(lr=0.01, momentum=0.5, decay=0.0, nesterov=True)
M.compile(loss='categorical_crossentropy', optimizer=optimizer,
metrics=['accuracy', 'perplexity'])
print("compiled")
self.model = M |
@braingineer Your metaphor is very interesting and thank you for your explanation.
I can't figure out whether the masked elements(i.e. 0) are ignored or not. Actually, before I use the method I mentioned above, I have trained another model in which I padded or truncated every sentence in my training set to a fixed length(e.g. 150) by model.add(Embedding(DICT_SIZE, EMBED_SIZE, input_length=MAX_SENTENCE_LEN, mask_zero=True)) I feel being cheated, however, by the high acc in the training stage (this may be because my training data labels include {0,1,2} and most of them is labelled by 0, originally 85%. Padding adds more 0 to my label set so I got 95% acc. Meanwhile, my inputs start from 1, I think the inputs 0 will be masked and the added label 0 will be masked too. But it seems like nothing being masked), because I got a low F1 score subsequently, 41%, on my test set. Besides, the test stage is a little complex cause I must handle the padded and truncated sentence carefully. I think it is very important to prepare the input and output_label of the LSTM. |
you can try to use the 0 only for padding and transform your labels just adding one to each one. I mean. your label 0 should be 1, your label 1 should be 2, and your label 2 should be 3. then you can get the originals subtracting 1. I think this make your accuracy more appropriate to the problem. I think all your 0 labels are ignored. |
they are. see the code here. the calculations there may appear off a bit, but if you work through the math, you'll notice that the calculations come out correct. my first post is exactly how I pad my sequences. I create a matrix/tensor of zeros that's the max size and then I fill in with each datapoint accordingly. @jllombart is also correct @ the labeling thing. You should reserve the 0 label as the mask. so, for instance, if you're making your word-to-index dictionary: word2idx = {'<MASK>': 0}
word2idx.update({word:i for i,word in enumerate(set(words_in_my_dataset), 1)})
idx2word = {i:word for word,i in word2idx.items()} This would work. I personally have a class written for this. You can see it here. |
@braingineer @jllombart thanks for your response, you are a great help. My mission is punctuation prediction. I treat it as a sequence labeling task.
Then I get bad F1 score, 41%
Then I get F1 score, 69.62% This is very interesting, right? I write a BLSTM model as follows: model1 = Sequential()
model1.add(Embedding(DICT_SIZE, EMBED_SIZE, input_shape=(1,)))
model1.add(LSTM(HIDDEN_SIZE, return_sequences=True))
model2 = Sequential()
model2.add(Embedding(DICT_SIZE, EMBED_SIZE, input_shape=(1,)))
model2.add(LSTM(HIDDEN_SIZE, return_sequences=True, go_backwards=True))
model = Sequential()
model.add(Merge([model1, model2], mode='concat'))
model.add(TimeDistributed(Dense(NUM_CLASS, activation='softmax'))) But I only get F1 score, 69.52% which is less than LSTM's F1score. I'm wondering whether this model is right or not. |
You either have a typo here or in your stuff. You have a 2 label on the symbol. The LSTM would arbitrarily learn to predict a period for the END symbol, then. the BLSTM wouldn't be any better because it could, at best, only learn that the FRONT of the backwards sequence is a period. Shouldn't it be Is there prior work on this topic? What is state of the art? |
@braingineer earlier in this conversation you mentioned this was a weak model - what exactly do you mean by that? Also - what kind of parameters (aka values) were you using to initialize the model you presented above? (self.igor.*)? And what was the main motivation to pick the ones you did? Thanks! |
Hi viksit, I meant that it was fairly shallow. The model had Embedding -> LSTM -> Dense predictions. I've found that
Also, having dropout between the LSTMs as in this paper works very well too. Finally, employing a dropout on your Embedding (it is one of the constructor parameters) and on the LSTM matrices (also in the constructor), has worked well for me in the past, and has some good justifications. (by the way, with that last link, all of that has been integrated into Keras already). Especially the Embedding, by the way. I've found major differences with and without embedding dropout. For parameters on the earlier model, I had:
edit: woops. thought you were op. edited for pronoun. |
@braingineer ah thanks for the pointer. For your empirical findings about those two kinds of models being effective - do you mean that for many to many sequences specifically, or any text based model that could be using a many to one prediction as well (Eg classification)? Have you found a difference in using pre-intialized embeddings such as glove directly into an LSTM, vs using them as init weights in a keras embeddings layer? Also - you talk about masking above. I've got a fundamental question - when/where are masking layers effective and why use them vs say padding? |
"Embedding" is just for input of the type of "int"? how can i pad the input if i want to use float vectors as input, such as word vector? |
re: performance Though, there is a trade off. If you increase it too much, then the number of parameters are too much for your data and the model will thrash because there's not enough evidence and there's too many "free variable", representation wise. I find that if I increase the number of stacked LSTMs, this happens. So basically, you want to give your model the right amount of representational complexity and capacity. Too complex, and the optimizer can't find a good gradient to follow given the data it's being fed. Too simple and the capacity means it has to generalize over things that were important in order to minimize its loss. If you're doing a simple problem, then these considerations play out differently because you're asking less of the sequence modeling. But basically, you should just be trying different configurations and getting a feel for what works. the cs231n class has a great page on this. re: pre-initialized embeddings. re: masking and padding. The sequences are made to all at the same length because you want to put them all into the same tensor, so they can all be fed into your model. But, when the model is computing things, it is agnostic to the things that were padded. So it'll go ahead and provide values for the spots you specifically padded. To counteract this, you construct a mask that will make your data in the correct form. In the two sequences of length 7 and 10, it will do nothing to the sequence of length 10, but a good mask should 0 out the last 3 values of the 7-length sequence vector. This is important, because if you're basing the performance of your algorithm as a classification at each time step and you don't inform it with a mask that the last 3 values of the 7-length sequence are not real values, then it will check the predictions made there and accumulate them into the metric. |
an embedding is a weight matrix where the rows represent discrete items in your input data. it is a trick to convert discrete items into continuous representations. if you already have continuous representations, then you have no need for an embedding matrix. though, to view the problem as fitting representations to the loss function for your task, you can increase the flexibility by adding a dense matrix on top of your existing continuous representations. This allows for the network to adjust the representations to better fit the loss function. |
@braingineer |
@braingineer thanks for the super comprehensive answer! Some follow ups, re: the embeddings - that makes sense. My question hinged on a slightly different note - when using embeddings, you can have two ways of doing things,
Personally, I've found that the second option is better at performance -- but was curious to see if you've seen similar results in your own experiments. Re: your other point about clean datasets -- makes sense as well. Do you actually think it better for the system to tokenize, clean punctuation et al before training, as well as at the prediction stage? or do you try and take into account those features as well? I guess it would depend on what you're trying to do with the architecture - classification vs generation of course. Lastly - intuitively, the size of an LSTM should depend purely on the maxlen of the sequence rather than account for the size of the dataset (100s of examples vs 10000s). But in case you're using stateful LSTMs, I have a suspicion that the size of the dataset too would matter. |
@kaituoxu @braingineer @jllombart @viksit @liangmin0020 thanks for some great discussion! I am implementing a LSTM POS tagger but still cannot get it to work. Here is my situation: X_pad.shape = (M, N)
y_pad.shape = (M, N) where M is the number (18421) of sentences in the corpus, and N is padded sentence length (originals vary from 15-140 so in this case N=140) Here is how I initialized the model model = Sequential()
# first embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=N, weights=[embedding_matrix]))
# hidden layer
model.add(LSTM(output_dim=hidden_dim, return_sequences=True))
# output layer
model.add(TimeDistributed(Dense(num_class, activation='softmax')))
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam') The error I got is Exception: Error when checking model target: expected timedistributed_1 to have 3 dimensions, but got array with shape (18421, 140) Been stuck here for a while. Any suggestion is appreciated! |
I am kinda new to LSTM configuration. Could you please let me know how I reshape my data for using in lstm? inputx : 10000 = #sequences . 5=#max seqence lenght (each is a vector of numbers) example of one sequence: |
@braingineer In your code, is the dense function (set of weights between LSTM and Dense layers) time-variant? If I understand it correctly, you are training a function for each time step, and as a result the labeling function F(X, t) depends on time variable too, not just the context of input. In many cases, I think we want the function to be time-invariant (depending on context, but not its position in a sequence). |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed. |
Maybe you need connectionist temporal classification , which can be used for end to end handwriting OCR |
@kaituoxu , Can you please share your code, I am working on similar problem. |
Hi! I have a situation where my inputs are just like the example given : X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ] They are just integers, no words are involved. So is there a simple way to implement it without embedding? |
I'm working on a sequence labeling task, and I want to use LSTM and BLSTM respectively to do this task.
I read some issues and docs but I still get poor result using LSTM.
My input and output like below:
I got 3 samples and each of them has different length.
X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ]
y = [ [0, 0, 2], [0, 1, 0, 0,2 ], [0, 1] ]
It's like the fifth architecture(from left) in the picture.
Anyone know how to implement it in Keras using LSTM and BLSTM respectively?
The text was updated successfully, but these errors were encountered: