Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functorch nvfuser revisions #363

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9474fc3
added nvfuser implementation, benchmark for biasReluDropout
Jul 7, 2022
5ea028e
reformatted fuse pattern
Jul 8, 2022
8453069
revised benchamrking, nvfused patterns
Jul 11, 2022
fdd6b16
adds BiasDropoutRes and BiasDropoutResLayernorm patterns, minor edits
Jul 13, 2022
291f439
unit testing for all fused patterns, minor edits
Jul 19, 2022
5004562
benchmarking for all nvfused patterns
Jul 19, 2022
ea85ea4
mypy wip
Jul 19, 2022
568c09a
benchmarking nvfuser patterns, adding plots, minor testing changes
Jul 22, 2022
7c7f6de
fixing mypy errors
Jul 25, 2022
8c59bb9
fixed benchmarking bug, minor test change
Jul 25, 2022
fd82a43
final benchmark plots, benchmmark edits
Jul 25, 2022
bd4499a
nvfuser documentation, minor edits
Jul 26, 2022
b004d87
fixing functorch version error, documentation revisions
Jul 26, 2022
14cc332
Merge branch 'main' into op_fusion_functorch
yuanandonly Jul 26, 2022
9ea013a
fixing circleci functorch errors, mypy errors
Jul 26, 2022
c774755
circleci config wip
Jul 27, 2022
4f18220
circleci test wip
Jul 27, 2022
d5e0765
wip2
Jul 27, 2022
477c208
testing revisions, circleci fixes, minor changes
Jul 27, 2022
7d9d659
changelog changes, fixes functorch flag bug
Jul 27, 2022
339a556
circle-ci fix
Jul 27, 2022
5d8221d
circle-ci spacing fix
Jul 27, 2022
d9199f0
build error wip
Jul 27, 2022
bcf746e
revised documentation, reverted circleci config
Jul 27, 2022
bd5b799
Fix functorch errors, circleci issue, testing changes
yuanandonly Jul 27, 2022
a6f3221
updating changelog
yuanandonly Jul 28, 2022
33431d0
added mlp plots, mlp functionality to switch weights to nvfused mlp
yuanandonly Aug 11, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fixing functorch version error, documentation revisions
  • Loading branch information
Chris Yuan committed Jul 26, 2022
commit b004d871f1ecd5a9df90721a66810daed0f5f29c
6 changes: 3 additions & 3 deletions HOWTO.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,11 +298,11 @@ Note that the pattern here is not that sparse (half of the matrix is empty), the

## AOTAutograd and NVFuser

AOT Autograd is a toolkit from [FuncTorch](https://pytorch.org/functorch/stable/) can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations as well as enables deep learning compilers such as [NVFuser](https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md) to perform operator fusion. The [`memory_efficient_fusion`](https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion) wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
AOT Autograd is a toolkit from [FuncTorch](https://pytorch.org/functorch/stable/) which can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations and enables deep learning compilers such as [NVFuser](https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md) to perform operator fusion. The [`memory_efficient_fusion`](https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion) wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.

XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into a single fused function layer. These parts can can be found [here](xformers/components/nvfuser). A notable example is [`NVFusedBiasActivationDropout`](xformers/components/nvfuser/bias_act_dropout.py), which is easily implementable inside the [`MLP`](xformers/components/feedforward/mlp.py) feedforward component.
XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers. These parts can be found [here](xformers/components/nvfuser). A notable example is [`NVFusedBiasActivationDropout`](xformers/components/nvfuser/bias_act_dropout.py), which is readily used inside the [`MLP`](xformers/components/feedforward/mlp.py) feedforward component.

A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach-- up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton ([see](xformers/triton/dropout.py)) as well. Peak memory usage of fused patterns is also lower on average, although we see some infrequent cases of up to 0.6x higher peak memory usage on larger shapes. Full benchmark plots can be found [here](docs/plots/nvfuser/).
A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach—up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower, although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton ([see](xformers/triton/dropout.py)) as well. Full benchmark plots can be found [here](docs/plots/nvfuser/).

Below is a simple example use case of AOT Autograd.

Expand Down
16 changes: 12 additions & 4 deletions docs/source/tutorials/aotautograd_nvfuser.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,25 @@
AOTAutograd and NVFuser
==========================

AOT Autograd is a toolkit from FuncTorch_ can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations as well as enables deep learning compilers such as NVFuser_ to perform operator fusion. The `memory_efficient_fusion`_ wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
AOT Autograd is a toolkit from FuncTorch_ which can be used to accelerate model training in xFormers.
Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time.
This allows for some joint graph optimizations enables deep learning compilers such as NVFuser_ to perform operator fusion.
The `memory_efficient_fusion`_ wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.

.. _FuncTorch: https://pytorch.org/functorch/stable/
.. _NVFuser: https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md
.. _memory_efficient_fusion: https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion

XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into a single fused function layer. These parts can can be found in `xformers/components/nvfuser`_. A notable example is `NVFusedBiasActivationDropout`_, which is easily implementable inside the `MLP`_ feedforward component.
XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers.
These parts can be found inside `xformers/components/nvfuser`_. A notable example is `NVFusedBiasActivationDropout`_, which is readily used inside the `MLP`_ feedforward component.

.. _xformers/components/nvfuser: https://github.com/facebookresearch/xformers/tree/main/xformers/components/nvfuser
.. _NVFusedBiasActivationDropout: https://github.com/facebookresearch/xformers/blob/main/xformers/components/nvfuser/bias_act_dropout.py
.. _MLP: https://github.com/facebookresearch/xformers/blob/main/xformers/components/feedforward/mlp.py

A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach-- up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton (see_) as well. Peak memory usage of fused patterns is also lower on average, although we see some infrequent cases of up to 0.6x higher peak memory usage on larger shapes. Full benchmark plots can be found here_.
A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach―up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower,
although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias,
Activation, and Dropout using Triton (see_) as well. Full benchmark plots can be found here_.

.. _see: https://github.com/facebookresearch/xformers/blob/main/xformers/triton/dropout.py
.. _here: https://github.com/facebookresearch/xformers/tree/main/docs/plots/nvfuser
Expand Down Expand Up @@ -57,4 +63,6 @@ Below is a simple example use case of AOT Autograd.
assert torch.allclose(b.grad, c_b.grad)
assert torch.allclose(c.grad, c_c.grad)

AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion. Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors. We are currently exploring further testing and benchmarking.
AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion.
Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors.
We are currently exploring further testing and benchmarking.
2 changes: 1 addition & 1 deletion requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ fairscale >= 0.4.5
triton == 2.0.0.dev20220403

# Dependency for fused layers, optional
functorch == 0.2.0
git+https://github.com/pytorch/functorch@v0.2.0