This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Fix transformer_moe model has wrong logic in pre/postprocessing #1233
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a wrong logic in transformer_moe model that make training loss not decrease and make decoding always generate empty output.
By comparing logic with transformer.py and common_attention.py, I found that 'dp_postprocess' should receive input 'x' before passing 'dp_preprocess' in order to have it run with the same logic with transformer model. I changed the logic as in this commit and ran test data to confirm that training loss is decrease and decoding generate correct result.
Unit Testing Result:
Before Fix:

After Fix:
