TE integration via full TransformerLayer #1297
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a sketch of using the attention picking mechanism ("global", "flash", NEW: "TE") to use the high level TransformerLayer from TransformerEngine. This is more of a prototype to show that integration with deepspeed is possible and what perf to expect.
Things that work:
Many aspects are hardcoded, e.g. RoPE and activation checkpointing can not be reconfigured from the config files. #1282 is much more elaborate in that it exposes TE layers on a much lower level. Meanwhile this PR could serve as a benchmark, showing what is possible with TE on a classic GPT2 style network.
I kept the implementation as minimal as possible, there is room for further performance depending on the workload. There is e.g. sequence parallelism and different memory layouts.
The dockerfile now uses a later ngc pytorch container and installs a later deepspeed tag from source for compatibility.