Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change optimizes the memory usage of scan by turning some copies into aliases. Specifically, it optimizes intermediate activations which are aliases of inputs. This is a common occurrence.
In principle, we could have
forward
return all the intermediate activations, including those that are aliases to an input tensor. However, those inputs will then be duplicated as part of the output of ascan
call, because we want to save all activations during the forward pass of ascan
. The XLA compiler can't optimize away this duplication likely because they're behind a DynamicSlice + DynamicUpdateSlice, so we end up doubling the memory usage from those inputs.To reduce memory usage, we can have
forward
return the activations that don't alias to inputs, calledpartial_activations
. The autograd implementation ofscan
will callalias_input
to add back activations that are aliases of input tensors outside of a scan, turning the partial activations back to full activations.