-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor Rematerialization (a.k.a. DTR/Coop) #9861
Conversation
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
…s not set so it is treated as 'in_memory' incorrectly
Signed-off-by: daquexian <daquexian566@gmail.com>
…ameters Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
CI failed when running job: cuda-speed-test. PR label automerge has been removed |
Speed stats:
|
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
Speed stats:
|
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9861/ |
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally. |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9861/ |
This pull request introduces substantial changes aimed at enabling recomputation support for tensor operations. The core logic is outlined as follows:
Device
attribute:Tensors that support or do not support recomputation are now distinguished based on the device they reside on. Devices like
flow.device("cuda+remat")
are introduced.The remat::Allocator now incorporates logic to select tensors with the lowest cost for eviction, optimizing memory layout and eviction strategies.
The OpCallInstructionUtil::Compute now implements logic to recompute tensors that were evicted but are subsequently needed.
Additionally, there are various peripheral changes aimed at improving overall functionality.
Usage Example (Python):
A comprehensive example showcasing the practical usage of recomputation support in a deep learning context is provided below:
A portion of the general changes has already been merged in previous pull requests:
PR #9698
PR #9791
PR #9850
PR #9851