Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras LSTM not converted correctly when precedent Embedding layer specifies mask_zero=True #1871

Open
q-ycong-p opened this issue Feb 28, 2022 · 17 comments
Assignees
Labels
contribution welcome Community contribution is welcomed

Comments

@q-ycong-p
Copy link
Contributor

q-ycong-p commented Feb 28, 2022

Describe the bug
When an tf.keras.layers.Embedding with attribute mask_zero=True attribute precede LSTM layer, the LSTM is converted into loops instead of LSTM op.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Tensorflow Version: 2.4.0
  • Python version: 3.8.12

To Reproduce

# minimal example of problematic keras model
x = tf.keras.layers.Input(shape=(4,), dtype="int32")
e = tf.keras.layers.Embedding(5, 5, mask_zero=True)(x)
rnn = tf.keras.layers.LSTM(3, return_sequences=True)(e)[0]
model = tf.keras.Model(inputs=x, outputs=rnn)

# converted onnx will have loops instead of LSTM op
onnx_model, _ = tf2onnx.convert.from_keras(model)
onnx.save(onnx_model, "lstm_masking_zero.onnx")

Screenshots
Original keras model to be converted

Unsuccessful conversion of lstm into loops

Additional context
I've tried modified lstm_tf2_rewriter to accommodate the new pattern in rewriter parsing. Although I can skip the extra SelectV2 pattern and get LSTM op in final onnx model, I am not able to correctly handle the masking zero information. My attempt will result in incorrect inference result if 0 is contained in input.

Below is my unsuccessful attempt: masking is ignored and results in incorrect inference result. Any suggestion on how masking should be handled? Thank you!

--- a/tf2onnx/rewriter/lstm_tf2_rewriter.py
+++ b/tf2onnx/rewriter/lstm_tf2_rewriter.py
@@ -56,21 +56,22 @@ def rewriter_lstm_tf2(g, ops):
             # extract output h_t
             ht_mul = match_result.get_op("ht")
             final_consumers = g.find_output_consumers(ht_mul.output[0])
-            select_ops = [n for n in final_consumers if n.type == "Select"]
+            select_ops = [n for n in final_consumers if n.type == "Select" or n.type == "SelectV2"]
             def has_tensor_list_consumer(n):
                 return any(c.type == "TensorListSetItem" for c in g.find_output_consumers(n.output[0]))
             select_ops = [n for n in select_ops if has_tensor_list_consumer(n)]
+
+            seq_len_idx = None
             if len(select_ops) == 1:
-                greater_eq = select_ops[0].inputs[0]
-                if greater_eq.type != "GreaterEqual":
-                    continue
-                seq_len = greater_eq.inputs[1]
-                if not seq_len.is_graph_input():
-                    continue
-                seq_len_idx = g.input_names.index(seq_len.output[0])
+                select_op_condition = select_ops[0].inputs[0]
+                if select_op_condition.type == "GreaterEqual": # has sequence length
+                    seq_len = select_op_condition.inputs[1]
+                    if not seq_len.is_graph_input():
+                        continue
+                    seq_len_idx = g.input_names.index(seq_len.output[0])
+                # if select op's condition doesn't come from GreaterEqual, we still extract
+                # output h_t from consumer, and seq_len remains empty
                 final_consumers = g.find_output_consumers(select_ops[0].output[0])
-            else:
-                seq_len_idx = None

Related links:

@hwangdeyu hwangdeyu added keras Issues related to Keras potential bug Error in codebase may cause a bug, but no concrete examples observed labels Mar 1, 2022
@hwangdeyu
Copy link
Contributor

Related issue: Keras LSTM converted to loop instead of ONNX LSTM op #1851

@q-ycong-p
Copy link
Contributor Author

q-ycong-p commented Mar 1, 2022

@hwangdeyu The reason why it's converted to loop in this case is because the SelectV2 in frozen graph pattern is not handled in lstm_tf2_rewriter - itself can be skipped to produce LSTM op (see code attempting to fix above). But the key issue is, the SelectV2 is there because of masking, and ONNX LSTM layer does not support masking. If skipped, model inference will be incorrect with zero-containing input. This means, for tf2onnx to convert into LSTM op and for model to produce correct masking behavior, work needed is perhaps beyond simply getting the loop op into LSTM op.

As far as I understand, the masking behavior is an intra-op operation. cjermain in this post gave idea on solution to padding (not general masking) which is already implemented in keras2onnx - which tf2onnx can also adapt?

@q-ycong-p
Copy link
Contributor Author

Hi @hwangdeyu and tf2onnx contributors,

This feature keras2onnx has implemented. I tested with the minimal example keras model shared above - keras2onnx implementation gives correct inference results on post-padded inputs. It does so by pre-processing the inputs before LSTM.

Can the same be implemented here to support masking for post-padded inputs?

@hwangdeyu
Copy link
Contributor

Thank you so much for so detailed issue, I am not very familiar with this part. I guess the keras2onnx is a good example to fix it. Need to do more investigation for it. Is there any model blocked by this issues?

Hi @hwangdeyu Deyu Huang FTE and tf2onnx contributors,

This feature keras2onnx has implemented. I tested with the minimal example keras model shared above - keras2onnx implementation gives correct inference results on post-padded inputs. It does so by pre-processing the inputs before LSTM.

Can the same be implemented here to support masking for post-padded inputs?

@hwangdeyu hwangdeyu added contribution welcome Community contribution is welcomed need investigation Need investigation labels Mar 17, 2022
@q-ycong-p
Copy link
Contributor Author

Thank you so much for so detailed issue, I am not very familiar with this part. I guess the keras2onnx is a good example to fix it. Need to do more investigation for it. Is there any model blocked by this issues?

Hi @hwangdeyu Deyu Huang FTE and tf2onnx contributors,
This feature keras2onnx has implemented. I tested with the minimal example keras model shared above - keras2onnx implementation gives correct inference results on post-padded inputs. It does so by pre-processing the inputs before LSTM.
Can the same be implemented here to support masking for post-padded inputs?

Hi, thank you for following up. Yes, we have an older model trained with TF-2.4 that converts with keras2onnx correctly. With keras2onnx deprecated and TF upgraded to TF-2.8, we need to correctly convert the model with tf2onnx. The production model is tricky to share but above minimal example should reproduce the issue. Will the community be able to take a look soon?

@q-ycong-p
Copy link
Contributor Author

Hi @hwangdeyu and tf2onnx community, is there any plan on picking up this work soon?

@buddhapuneeth
Copy link
Contributor

@hwangdeyu we have some urgent request for this fix (07/01). Please let us know if you are able to pick this sometime soon.

@hwangdeyu
Copy link
Contributor

hwangdeyu commented Jun 17, 2022

Hi @hwangdeyu and tf2onnx community, is there any plan on picking up this work soon?

Yeah, sorry for late reply, we will make a plan to fix this recently.

@hwangdeyu hwangdeyu self-assigned this Jun 21, 2022
@q-ycong-p
Copy link
Contributor Author

Hi @hwangdeyu, could you give us an update? We'll need to work around the urgent request (07/01) we got depending on your plan for this fix. Pls let us know.

@hwangdeyu
Copy link
Contributor

hwangdeyu commented Jun 29, 2022

Hi @hwangdeyu Deyu Huang FTE, could you give us an update? We'll need to work around the urgent request (07/01) we got depending on your plan for this fix. Pls let us know.

We went offsite several days. And we are doing with this issue, but the progress has been limited so far. I‘m not sure it can be solved before July 1st..

@q-ycong-p
Copy link
Contributor Author

@hwangdeyu Thanks for the update. Could you provide us with an ETA?

@q-ycong-p
Copy link
Contributor Author

@hwangdeyu Given the request on our end, we need to work towards a fix too. Could you give us some pointer/direction so that we can contribute/collaborate?

@hwangdeyu
Copy link
Contributor

hwangdeyu commented Jun 30, 2022

@hwangdeyu Given the request on our end, we need to work towards a fix too. Could you give us some pointer/direction so that we can contribute/collaborate?

Just as onnx/onnx#2248 (comment) said, the first type has been implemented in keras2onnx. The second type may be better, but I don't think ONNX will change the KSTM op cause the recent new models are less likely to use it. I don't think of a better way so far and my plan will sync with my colleague to try the same with kears2onnx to support masking for post-padded inputs. And thanks, I will ping you if I find anything need you guy contribute..
By the way, it's better to get a simple reproduce code to verify if our change is work for correct inference results. Actually, there was a keras CI test using the mask_zero==True can be convert into ONNX and get the same inference results with keras.

@q-ycong-p
Copy link
Contributor Author

Thanks for the pointer. Regarding the reproduction, I think the minimal example shared at top of post is sufficiently simple to verify wip fix? We can assert both inference result to equal the TF result, and assert LSTM op is correctly converted. It seems that linked test only verify inference result equal, but doesn't verify LSTM op is produced (instead of loop)? Maybe I didn't understand your ask correctly?

@hwangdeyu
Copy link
Contributor

Hi @q-ycong-p , I have synced with my colleague and did more tests about it. Just as the CI tests and your example shown, the inference results from original converted model Loop op is expected with keras inference results specifying mask_zero=True, no matter whether 0 is contained in input with. And it's also hard to support masking LSTM op with ONNX, so I'm sorry that we won't spent more efforts to do it. Hope you could understand it.

Thanks for the pointer. Regarding the reproduction, I think the minimal example shared at top of post is sufficiently simple to verify wip fix? We can assert both inference result to equal the TF result, and assert LSTM op is correctly converted. It seems that linked test only verify inference result equal, but doesn't verify LSTM op is produced (instead of loop)? Maybe I didn't understand your ask correctly?

@q-ycong-p
Copy link
Contributor Author

Hi @hwangdeyu, thanks for letting me know. Even inference result is correct, LSTM being converted to loop op means important optimizations cannot be taken advantaged of at inference time. I'm trying to work towards a fix for the post-padded masking scenario, using the example @cjermain from keras2onnx has provided. Will keep it posted and might seek help if stuck.

@fatcat-z fatcat-z added need investigation Need investigation and removed need investigation Need investigation potential bug Error in codebase may cause a bug, but no concrete examples observed labels Aug 10, 2022
@fatcat-z fatcat-z removed need investigation Need investigation keras Issues related to Keras labels Aug 30, 2022
@hwangdeyu hwangdeyu assigned q-ycong-p and unassigned hwangdeyu Aug 31, 2022
@AndreyOrb
Copy link

Hello,
Is there any update/eta on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution welcome Community contribution is welcomed
Projects
None yet
Development

No branches or pull requests

5 participants