fixing functorch version error, documentation revisions

facebookresearch · yuanandonly · Jul 7, 2022 · Jul 8, 2022 · Jul 11, 2022 · Jul 13, 2022
commit b004d871f1ecd5a9df90721a66810daed0f5f29c
diff --git a/HOWTO.md b/HOWTO.md
@@ -298,11 +298,11 @@ Note that the pattern here is not that sparse (half of the matrix is empty), the
 
 ## AOTAutograd and NVFuser
 
-AOT Autograd is a toolkit from [FuncTorch](https://pytorch.org/functorch/stable/) can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations as well as enables deep learning compilers such as [NVFuser](https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md) to perform operator fusion. The [`memory_efficient_fusion`](https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion) wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
+AOT Autograd is a toolkit from [FuncTorch](https://pytorch.org/functorch/stable/) which can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations and enables deep learning compilers such as [NVFuser](https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md) to perform operator fusion. The [`memory_efficient_fusion`](https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion) wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
 
-XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into a single fused function layer. These parts can can be found [here](xformers/components/nvfuser). A notable example is [`NVFusedBiasActivationDropout`](xformers/components/nvfuser/bias_act_dropout.py), which is easily implementable inside the [`MLP`](xformers/components/feedforward/mlp.py) feedforward component.
+XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers. These parts can be found [here](xformers/components/nvfuser). A notable example is [`NVFusedBiasActivationDropout`](xformers/components/nvfuser/bias_act_dropout.py), which is readily used inside the [`MLP`](xformers/components/feedforward/mlp.py) feedforward component.
 
-A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach-- up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton ([see](xformers/triton/dropout.py)) as well. Peak memory usage of fused patterns is also lower on average, although we see some infrequent cases of up to 0.6x higher peak memory usage on larger shapes. Full benchmark plots can be found [here](docs/plots/nvfuser/).
+A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach&mdash;up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower, although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton ([see](xformers/triton/dropout.py)) as well. Full benchmark plots can be found [here](docs/plots/nvfuser/).
 
 Below is a simple example use case of AOT Autograd.
 

diff --git a/docs/source/tutorials/aotautograd_nvfuser.rst b/docs/source/tutorials/aotautograd_nvfuser.rst
@@ -1,19 +1,25 @@
 AOTAutograd and NVFuser
 ==========================
 
-AOT Autograd is a toolkit from FuncTorch_ can be used to accelerate model training in xFormers. Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time. This allows for some joint graph optimizations as well as enables deep learning compilers such as NVFuser_ to perform operator fusion. The `memory_efficient_fusion`_ wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
+AOT Autograd is a toolkit from FuncTorch_ which can be used to accelerate model training in xFormers.
+Broadly, it extracts a computational graph of the forward and backward passes of a model ahead of time.
+This allows for some joint graph optimizations enables deep learning compilers such as NVFuser_ to perform operator fusion.
+The `memory_efficient_fusion`_ wrapper function provides a convenient way to leverage AOTAutograd and NVFuser on GPU.
 
 .. _FuncTorch: https://pytorch.org/functorch/stable/
 .. _NVFuser: https://github.com/pytorch/pytorch/blob/release/1.12/torch/csrc/jit/codegen/cuda/README.md
 .. _memory_efficient_fusion: https://pytorch.org/functorch/stable/generated/functorch.compile.memory_efficient_fusion.html#functorch.compile.memory_efficient_fusion
 
-XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into a single fused function layer. These parts can can be found in `xformers/components/nvfuser`_. A notable example is `NVFusedBiasActivationDropout`_, which is easily implementable inside the `MLP`_ feedforward component.
+XFormers uses `memory_efficient_fusion` to combine sequences of fusable operations together into single fused function layers.
+These parts can be found inside `xformers/components/nvfuser`_. A notable example is `NVFusedBiasActivationDropout`_, which is readily used inside the `MLP`_ feedforward component.
 
 .. _xformers/components/nvfuser: https://github.com/facebookresearch/xformers/tree/main/xformers/components/nvfuser
 .. _NVFusedBiasActivationDropout: https://github.com/facebookresearch/xformers/blob/main/xformers/components/nvfuser/bias_act_dropout.py
 .. _MLP: https://github.com/facebookresearch/xformers/blob/main/xformers/components/feedforward/mlp.py
 
-A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach-- up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. We also see better overall performance against our implementation of fused Bias, Activation, and Dropout using Triton (see_) as well. Peak memory usage of fused patterns is also lower on average, although we see some infrequent cases of up to 0.6x higher peak memory usage on larger shapes. Full benchmark plots can be found here_.
+A benchmark of these fused patterns across some representative shapes shows significant speed increases compared to the unfused, Pytorch eager approach―up to 3.5x speedup for the forward pass and 2.2x for the forward and backward passes together. On average, peak memory usage of fused patterns is also lower,
+although we see some infrequent cases of up to 1.6x Pytorch peak memory usage on larger shapes. We also see better overall performance against our implementation of fused Bias,
+Activation, and Dropout using Triton (see_) as well. Full benchmark plots can be found here_.
 
 .. _see: https://github.com/facebookresearch/xformers/blob/main/xformers/triton/dropout.py
 .. _here: https://github.com/facebookresearch/xformers/tree/main/docs/plots/nvfuser
@@ -57,4 +63,6 @@ Below is a simple example use case of AOT Autograd.
     assert torch.allclose(b.grad, c_b.grad)
     assert torch.allclose(c.grad, c_c.grad)
 
-AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion. Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors. We are currently exploring further testing and benchmarking.
+AOT Autograd offers a great deal a flexibility to the user, as `memory_efficient_fusion` can accept either a Python function or an entire `nn.Module` as input for fusion.
+Currently in xFormers, however, it is only used with Python function inputs because initial attempts with fusing xFormers layers and blocks have yielded memory issues and other CUDA errors.
+We are currently exploring further testing and benchmarking.
diff --git a/requirements-test.txt b/requirements-test.txt
@@ -30,4 +30,4 @@ fairscale >= 0.4.5
 triton == 2.0.0.dev20220403
 
 # Dependency for fused layers, optional
-functorch == 0.2.0
+git+https://github.com/pytorch/functorch@v0.2.0