Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to train a many to many sequence labeling using LSTM and BLSTM respectively? #2654

Closed
kaituoxu opened this issue May 7, 2016 · 23 comments

Comments

@kaituoxu
Copy link

kaituoxu commented May 7, 2016

I'm working on a sequence labeling task, and I want to use LSTM and BLSTM respectively to do this task.
I read some issues and docs but I still get poor result using LSTM.

My input and output like below:
I got 3 samples and each of them has different length.
X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ]
y = [ [0, 0, 2], [0, 1, 0, 0,2 ], [0, 1] ]

It's like the fifth architecture(from left) in the picture.
Anyone know how to implement it in Keras using LSTM and BLSTM respectively?
pic

@kaituoxu kaituoxu changed the title how to use LSTM or BLSTM to implement sequence labeling(like POS) in keras? how to train a many to many sequence labeling using LSTM and BLSTM respectively? May 7, 2016
@braingineer
Copy link
Contributor

braingineer commented May 7, 2016

Are you padding your inputs?, using an embedding layer, etc?

A basic structure:

xin = Input(batch_shape=(batch, timesteps), dtype='int32')
xemb = Embedding(vocab_size, embedding_size)(xin) # 3dim (batch,time,feat)
seq = LSTM(seq_size, return_sequences=True)(xemb)
mlp = TimeDistributed(Dense(mlp_size, activation='softmax'))(seq)
model = Model(input=xin, output=mlp)
model.compile(optimizer='Adam', loss='categorical_crossentropy')

an example of how you could serve your sentence/sequence data:

    def serve_sentence(self, data):
        for data_i in np.random.choice(len(data), len(data), replace=False):
            in_X = np.zeros(self.max_sequence_len)
            out_Y = np.zeros(self.max_sequence_len, dtype=np.int32)
            bigram_data = zip(data[data_i][0:-1], data[data_i][1:])
            for datum_j,(datum_in, datum_out) in enumerate(bigram_data):
                in_X[datum_j] = datum_in
                out_Y[datum_j] = datum_out
            yield in_X, out_Y

    def serve_batch(self, data):
        dataiter = self.serve_sentence(data)
        V = self.vocab_size
        S = self.max_sequence_len
        B = self.batch_size

        while dataiter:
            in_X = np.zeros((B, S), dtype=np.int32)
            out_Y = np.zeros((B, S, V), dtype=np.int32)
            next_batch = list(itertools.islice(dataiter, 0, self.batch_size))
            if len(next_batch) < self.batch_size:
                raise StopIteration
            for d_i, (d_X, d_Y) in enumerate(next_batch):
                in_X[d_i] = d_X
                out_Y[d_i] = to_categorical(d_Y, V)

            yield in_X, out_Y

@kaituoxu
Copy link
Author

kaituoxu commented May 7, 2016

@braingineer Thank you for your help.
I have trained a model which is quite similar with your architecture:

model = Sequential()                                                             
model.add(Embedding(DICT_SIZE, EMBED_SIZE, input_length=MAX_SENTENCE_LEN))       
model.add(LSTM(HIDDEN_SIZE, return_sequences=True))                              
model.add(TimeDistributed(Dense(NUM_CLASS, activation='softmax')))               

model.compile(loss='categorical_crossentropy',                                   
              optimizer='rmsprop',                                               
              metrics=['accuracy'])

But I got poor result on my test set. I got a low F1 score(if you don't hear about it, you can treat it as accuracy), 41%.
I prepared my training set including input_X and output_label in this way:
Each sentence in my training set has different length, and I don't want to pad my training set to a fixed sentence length, so I concatenate every sentence to one sentence, then I split this sentence every LEN(e.g. 150) words. Python code as follows:

def gen_onefile(indexfile, sen_len=150):
    with open(indexfile, 'r') as inf:
        inputs = []
        for line in inf:
            inputs += [int(num) for num in line.split()]
        ult_in=[]
        for i in range(len(inputs)/sen_len):
            ult_in.append(inputs[i*sen_len: (i+1)*sen_len])
        inputs = np.array(ult_in).astype(np.int32)
    return inputs

def get_X_y(indexfile, labelfile, sen_len=150):
    inputs = gen_onefile(indexfile, sen_len)
    labels = gen_onefile(labelfile, sen_len)
    assert len(inputs) == len(labels), "not equal"
    return inputs, labels

indexfile like: X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ]
labelfile like: y = [ [0, 0, 2], [0, 1, 0, 0,2 ], [0, 1] ], and I use to_categorical to expand each label, e.g. 2 to [0 0 1].

inputs will be [[123, 2, 3], [4, 5, 22], [10, 2, 1]]
labels will be [[[1 0 0] [1 0 0] [0 0 1]], [[1 0 0] [0 1 0] [ 1 0 0]], [[1 0 0] [0 0 1] [1 0 0]]]

Have I done wrong somewhere?
How to solve it?

@braingineer
Copy link
Contributor

braingineer commented May 7, 2016

I got a low F1 score(if you don't hear about it, you can treat it as accuracy), 41%

lol. I'm in academia. i hear about it too often.

don't want to pad

why? the method you described doesn't make any sense.

what you're basically telling your learner by concatenating all of your data is "hey, sequences that are 150 words long and run into each other are important, so learn to predict this". but there's no real patterns that follow that, or at least there isn't until you get infinite data.

padding gives you a way to input sentences that don't confuse each other. Padding does not affect you statistically. internally, masked elements are ignored (in the RNNs, they are passed on, and in the loss measurements, they are used to adjust the loss). Though, there is some care to be taken if you calculate perplexity. I've on here about it, but I haven't had time to submit a proper PR for it.

As far as making your model better: it's a very weak model. That may also be the issue. Though, if your data is simple enough, it shouldn't be too much of a problem. I made a simple model with keras a couple weeks ago and got very near to the state of the art on the language modeling version of the WSJ corpus (0-20, 21-22 for dev, 23-24 for test). I've pasted it below. I also initialized my embeddings with 300d glove (you can get these from the stanford website).

        B = self.igor.batch_size
        R = self.igor.rnn_size
        S = self.igor.max_sequence_len
        V = self.igor.vocab_size
        E = self.igor.embedding_size
        emb_W = self.igor.embeddings.astype(theano.config.floatX)

        ## dropout parameters
        p_emb = self.igor.p_emb_dropout
        p_W = self.igor.p_W_dropout
        p_U = self.igor.p_U_dropout
        p_dense = self.igor.p_dense_dropout
        w_decay = self.igor.weight_decay



        M = Sequential()
        M.add(Embedding(V, E, batch_input_shape=(B,S), 
                        W_regularizer=l2(w_decay),
                        weights=[emb_W], mask_zero=True, dropout=p_emb))

        #for i in range(self.igor.num_lstms):
        M.add(LSTM(R, return_sequences=True, dropout_W=p_W, dropout_U=p_U, 
                      U_regularizer=l2(w_decay), W_regularizer=l2(w_decay)))

        M.add(Dropout(p_dense))

        M.add(LSTM(R*int(1/p_dense), return_sequences=True, dropout_W=p_W, dropout_U=p_U))

        M.add(Dropout(p_dense))

        M.add(TimeDistributed(Dense(V, activation='softmax', 
                                       W_regularizer=l2(w_decay), b_regularizer=l2(w_decay))))



        print("compiling")
        optimizer = Adam(self.igor.LR, clipnorm=self.igor.max_grad_norm, 
                                       clipvalue=5.0)
        #optimizer = SGD(lr=0.01, momentum=0.5, decay=0.0, nesterov=True)
        M.compile(loss='categorical_crossentropy', optimizer=optimizer, 
                                                   metrics=['accuracy', 'perplexity'])
        print("compiled")
        self.model = M

@kaituoxu
Copy link
Author

kaituoxu commented May 7, 2016

@braingineer Your metaphor is very interesting and thank you for your explanation.

internally, masked elements are ignored (in the RNNs, they are passed on, and in the loss measurements, they are used to adjust the loss).

I can't figure out whether the masked elements(i.e. 0) are ignored or not.

Actually, before I use the method I mentioned above, I have trained another model in which I padded or truncated every sentence in my training set to a fixed length(e.g. 150) by sequence.pad_sequences(X_train, maxlen=150), and I build my model using mask_zero=True as follows.

model.add(Embedding(DICT_SIZE, EMBED_SIZE, input_length=MAX_SENTENCE_LEN, mask_zero=True))

I feel being cheated, however, by the high acc in the training stage (this may be because my training data labels include {0,1,2} and most of them is labelled by 0, originally 85%. Padding adds more 0 to my label set so I got 95% acc. Meanwhile, my inputs start from 1, I think the inputs 0 will be masked and the added label 0 will be masked too. But it seems like nothing being masked), because I got a low F1 score subsequently, 41%, on my test set. Besides, the test stage is a little complex cause I must handle the padded and truncated sentence carefully.

I think it is very important to prepare the input and output_label of the LSTM.
Can you put your code about the way you prepare your input and output?
It couldn't be better if you can describe your original data format and the way you prepare them to the keras LSTM model in detail.

@jllombart
Copy link

you can try to use the 0 only for padding and transform your labels just adding one to each one. I mean. your label 0 should be 1, your label 1 should be 2, and your label 2 should be 3. then you can get the originals subtracting 1. I think this make your accuracy more appropriate to the problem. I think all your 0 labels are ignored.

@braingineer
Copy link
Contributor

I can't figure out whether the masked elements(i.e. 0) are ignored or not.

they are. see the code here. the calculations there may appear off a bit, but if you work through the math, you'll notice that the calculations come out correct.

my first post is exactly how I pad my sequences. I create a matrix/tensor of zeros that's the max size and then I fill in with each datapoint accordingly.

@jllombart is also correct @ the labeling thing. You should reserve the 0 label as the mask.

so, for instance, if you're making your word-to-index dictionary:

word2idx = {'<MASK>': 0}
word2idx.update({word:i for i,word in enumerate(set(words_in_my_dataset), 1)})
idx2word = {i:word for word,i in word2idx.items()}

This would work. I personally have a class written for this. You can see it here.

@kaituoxu
Copy link
Author

@braingineer @jllombart thanks for your response, you are a great help.

My mission is punctuation prediction. I treat it as a sequence labeling task.
It seems as if I have run keras-LSTM correctly:

  1. When I tag each word in the sentence with a number which represent whether there is a punctuation following this word or not. e.g. 0 is no punc, 1 is comma, 2 is period.
e.g.:
text---->hi, github and keras.
onehot index----> [ 10 8 7 6]
label ---->[ 1 0 0 2 ]

Then I get bad F1 score, 41%
2. When I tag each word in the sentence with a number which represent whether there is a punctuation prior to this word or not. e.g. 0 is no punc, 1 is comma, 2 is period. And add <END> to the end of each sentence.

e.g.:
text---->hi, github and keras.
expand---->hi, github and keras. <END>
onehot index----> [ 10 8 7 6 100001]
label ---->[ 0 1 0 0 2 ]

Then I get F1 score, 69.62%

This is very interesting, right?

I write a BLSTM model as follows:

model1 = Sequential()
model1.add(Embedding(DICT_SIZE, EMBED_SIZE, input_shape=(1,)))
model1.add(LSTM(HIDDEN_SIZE, return_sequences=True))

model2 = Sequential()
model2.add(Embedding(DICT_SIZE, EMBED_SIZE, input_shape=(1,)))
model2.add(LSTM(HIDDEN_SIZE, return_sequences=True, go_backwards=True))

model = Sequential()
model.add(Merge([model1, model2], mode='concat'))
model.add(TimeDistributed(Dense(NUM_CLASS, activation='softmax')))

But I only get F1 score, 69.52% which is less than LSTM's F1score.

I'm wondering whether this model is right or not.
If you have any idea, please let me know. Thank you.

@braingineer
Copy link
Contributor

ext---->hi, github and keras.
expand---->hi, github and keras. <END>
onehot index----> [ 10 8 7 6 100001]
label ---->[ 0 1 0 0 2 ]

This is very interesting, right?

You either have a typo here or in your stuff. You have a 2 label on the symbol. The LSTM would arbitrarily learn to predict a period for the END symbol, then. the BLSTM wouldn't be any better because it could, at best, only learn that the FRONT of the backwards sequence is a period.

Shouldn't it be
0 1 0 2 0?

Is there prior work on this topic? What is state of the art?

@viksit
Copy link

viksit commented May 11, 2016

@braingineer earlier in this conversation you mentioned this was a weak model - what exactly do you mean by that?

Also - what kind of parameters (aka values) were you using to initialize the model you presented above? (self.igor.*)? And what was the main motivation to pick the ones you did?

Thanks!

@braingineer
Copy link
Contributor

braingineer commented May 12, 2016

Hi viksit,

I meant that it was fairly shallow. The model had Embedding -> LSTM -> Dense predictions.

I've found that

  • Embedding -> LSTM -> DENSE -> Dense or
  • Embedding -> LSTM -> LSTM -> Dense -> Dense
    work pretty well.

Also, having dropout between the LSTMs as in this paper works very well too. Finally, employing a dropout on your Embedding (it is one of the constructor parameters) and on the LSTM matrices (also in the constructor), has worked well for me in the past, and has some good justifications. (by the way, with that last link, all of that has been integrated into Keras already).

Especially the Embedding, by the way. I've found major differences with and without embedding dropout.

For parameters on the earlier model, I had:

num_epochs: 1500
max_grad_norm: 10
LR: 0.0005
### ###############
## model parameters
#########
embedding_size: 300
rnn_size: 368
batch_size: 32
p_emb_dropout: 0.5
p_W_dropout: 0.5
p_U_dropout: 0.5
p_dense_dropout: 0.5
weight_decay: 1e-8

edit: woops. thought you were op. edited for pronoun.

@viksit
Copy link

viksit commented May 15, 2016

@braingineer ah thanks for the pointer. For your empirical findings about those two kinds of models being effective - do you mean that for many to many sequences specifically, or any text based model that could be using a many to one prediction as well (Eg classification)?

Have you found a difference in using pre-intialized embeddings such as glove directly into an LSTM, vs using them as init weights in a keras embeddings layer?

Also - you talk about masking above. I've got a fundamental question - when/where are masking layers effective and why use them vs say padding?

@liangmin0020
Copy link

"Embedding" is just for input of the type of "int"? how can i pad the input if i want to use float vectors as input, such as word vector?

@braingineer
Copy link
Contributor

@viksit

re: performance
it's just about capacity. you're trying to encode the regularities in the data. can they be encoded with the amount of representational space you provide, or do you need to increase it? when a vocabulary is on the order of 10s of 1000s, then you need a larger presentational capacity. Especially when it gets noisy. For instance, the PTB data that language modeling people use has some things abstract: numbers and other messy bits are replaced, everything is lowercase, and most punctuation is removed. I'm training a language model on the same data source, the penn tree bank, but I'm linearizing it myself rather than using the super clean version previous mentioned. Because of the nature of my problem, i am leaving in punctuation and I'm being less vigorous in the replacement of things like numbers. Because of this, my model is only able to get down to like 24% accrucy and 118 perplexity on the dev set. The exact same model reached 33% accuracy and 78 perplexity on the super clean PTB. So, this is a difference in noise. If I increased the size of my hidden layers, it might be able to learn more because the representational capacity increased.

Though, there is a trade off. If you increase it too much, then the number of parameters are too much for your data and the model will thrash because there's not enough evidence and there's too many "free variable", representation wise. I find that if I increase the number of stacked LSTMs, this happens. So basically, you want to give your model the right amount of representational complexity and capacity. Too complex, and the optimizer can't find a good gradient to follow given the data it's being fed. Too simple and the capacity means it has to generalize over things that were important in order to minimize its loss.

If you're doing a simple problem, then these considerations play out differently because you're asking less of the sequence modeling. But basically, you should just be trying different configurations and getting a feel for what works. the cs231n class has a great page on this.

re: pre-initialized embeddings.
totally. colloquial reports say that glove works the best as compared to those of levy or word2vec. The reason probably has to do with similarity of objectives used for the embedding learning and your task. embeddings have also been call the 'sriracha' of nlp: whatever you add them to gets a little better.

re: masking and padding.
Masking is padding. You have two sequences of size 7 and 10, let's say. You check your dataset and indeed, 10 is the longest you will ever see. So, you make all of your inputs of length 10. But for those that aren't actually length 10, like the sequence of length 7, you just leave the rest of the 10-length array as zeros (you can view this either as taking a 7-length array and padding with 0s to 10, or taking a 10-length array and setting the first 7 values. I've found the second way to be conceptually simpler).

The sequences are made to all at the same length because you want to put them all into the same tensor, so they can all be fed into your model. But, when the model is computing things, it is agnostic to the things that were padded. So it'll go ahead and provide values for the spots you specifically padded. To counteract this, you construct a mask that will make your data in the correct form. In the two sequences of length 7 and 10, it will do nothing to the sequence of length 10, but a good mask should 0 out the last 3 values of the 7-length sequence vector.

This is important, because if you're basing the performance of your algorithm as a classification at each time step and you don't inform it with a mask that the last 3 values of the 7-length sequence are not real values, then it will check the predictions made there and accumulate them into the metric.

@braingineer
Copy link
Contributor

@liangmin0020

an embedding is a weight matrix where the rows represent discrete items in your input data. it is a trick to convert discrete items into continuous representations. if you already have continuous representations, then you have no need for an embedding matrix.

though, to view the problem as fitting representations to the loss function for your task, you can increase the flexibility by adding a dense matrix on top of your existing continuous representations. This allows for the network to adjust the representations to better fit the loss function.

@kaituoxu
Copy link
Author

@braingineer
I'm sorry for response you too late.
Thank you anyway.
There is no such work in punctuation detection using LSTM in my language.
Maybe I choosed wrong metric to evaluate my model.

@viksit
Copy link

viksit commented May 18, 2016

@braingineer thanks for the super comprehensive answer! Some follow ups,

re: the embeddings - that makes sense. My question hinged on a slightly different note - when using embeddings, you can have two ways of doing things,

  • Rather than use an embedding layer, you simply use glove embeddings (which have shown me better perf than plain google w2v as well), and say, feed them into an LSTM.
  • The second option is to use an embedding layer, wherein you pre-init the weights via glove, and then update the weights as per your data set before feeding into said LSTM.

Personally, I've found that the second option is better at performance -- but was curious to see if you've seen similar results in your own experiments.

Re: your other point about clean datasets -- makes sense as well. Do you actually think it better for the system to tokenize, clean punctuation et al before training, as well as at the prediction stage? or do you try and take into account those features as well? I guess it would depend on what you're trying to do with the architecture - classification vs generation of course.

Lastly - intuitively, the size of an LSTM should depend purely on the maxlen of the sequence rather than account for the size of the dataset (100s of examples vs 10000s). But in case you're using stateful LSTMs, I have a suspicion that the size of the dataset too would matter.

@jayinai
Copy link

jayinai commented Sep 29, 2016

@kaituoxu @braingineer @jllombart @viksit @liangmin0020

thanks for some great discussion! I am implementing a LSTM POS tagger but still cannot get it to work. Here is my situation:

X_pad.shape = (M, N)
y_pad.shape = (M, N)

where M is the number (18421) of sentences in the corpus, and N is padded sentence length (originals vary from 15-140 so in this case N=140)

Here is how I initialized the model

  model = Sequential()

  # first embedding layer
  model.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=N, weights=[embedding_matrix]))

  # hidden layer
  model.add(LSTM(output_dim=hidden_dim, return_sequences=True))

  # output layer
  model.add(TimeDistributed(Dense(num_class, activation='softmax')))

  # compile
  model.compile(loss='categorical_crossentropy', optimizer='adam')

The error I got is

Exception: Error when checking model target: expected timedistributed_1 to have 3 dimensions, but got array with shape (18421, 140)

Been stuck here for a while. Any suggestion is appreciated!

@Reihan-amn
Copy link

I am kinda new to LSTM configuration.
I have the sequence of words and each word is a vector of floating point (word2vec).
Now my data is a 3d data. data = [sequences=[words= vectors of numbers]]

Could you please let me know how I reshape my data for using in lstm?

inputx : 10000 = #sequences . 5=#max seqence lenght (each is a vector of numbers)
50 = #lenght of each point in sequence.

example of one sequence:
[[0, 0,1, 1.3], [6,3,1,1.5], [6, 4, 1.4, 4.5]] -> [1, 3, 4]

@stale stale bot removed the stale label Jun 15, 2017
@ylmeng
Copy link

ylmeng commented Aug 3, 2017

@braingineer In your code, is the dense function (set of weights between LSTM and Dense layers) time-variant? If I understand it correctly, you are training a function for each time step, and as a result the labeling function F(X, t) depends on time variable too, not just the context of input. In many cases, I think we want the function to be time-invariant (depending on context, but not its position in a sequence).
Or maybe TimeDistributed(Dense(...)) actually sets time-invariant weights? I may need to look at the source code.

@stale
Copy link

stale bot commented Nov 21, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot added the stale label Nov 21, 2017
@doofin
Copy link

doofin commented Mar 15, 2018

Maybe you need connectionist temporal classification , which can be used for end to end handwriting OCR

@priyanksonis
Copy link

@kaituoxu , Can you please share your code, I am working on similar problem.

@pasan9
Copy link

pasan9 commented Mar 22, 2019

Hi! I have a situation where my inputs are just like the example given :

X = [ [123, 2, 3], [4, 5 ,22, 10, 2], [1, 5] ]
y = [ [0, 0, 2], [0, 1, 0, 0,2 ], [0, 1] ]

They are just integers, no words are involved. So is there a simple way to implement it without embedding?
Unique numbers in input - 49
Number of classes - 150
The class is decided from the value of the integer and it's position in the sequence.
The sequences are of different lengths, so I think padding and masking might help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests