New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Tensor Rematerialization (a.k.a. DTR/Coop) #9861

Merged

mergify merged 113 commits into master from dtr5

Apr 3, 2023

Contributor

daquexian commented Feb 13, 2023 •

edited

Loading

This pull request introduces substantial changes aimed at enabling recomputation support for tensor operations. The core logic is outlined as follows:

Differentiating Tensors Based on Device attribute:

Tensors that support or do not support recomputation are now distinguished based on the device they reside on. Devices like flow.device("cuda+remat") are introduced.

Optimized Tensor Allocation and Eviction in remat::Allocator:

The remat::Allocator now incorporates logic to select tensors with the lowest cost for eviction, optimizing memory layout and eviction strategies.

Recomputation Logic in OpCallInstructionUtil::Compute:

The OpCallInstructionUtil::Compute now implements logic to recompute tensors that were evicted but are subsequently needed.
Additionally, there are various peripheral changes aimed at improving overall functionality.

Usage Example (Python):

x1 = flow.ones(3).to('cuda+remat')  # Move to a device supporting recomputation
x2 = flow.ones(3).to('cuda')        # Move to a device not supporting recomputation
x3 = x2 + x3                         # Error: devices are different

A comprehensive example showcasing the practical usage of recomputation support in a deep learning context is provided below:

model = ResNet50()
model.to('cuda+remat')
data, label = dataloader()
data, label = data.to('cuda+remat'), label.to('cuda+remat')
loss = model(data)  # Automatically evicts tensors if GPU memory is full
loss.backward()     # Recomputes evicted tensors if needed in subsequent computations

A portion of the general changes has already been merged in previous pull requests:

PR #9698
PR #9791
PR #9850
PR #9851

daquexian added 30 commits

November 9, 2022 16:00


          init

d4ba796

Signed-off-by: daquexian <daquexian566@gmail.com>


          ONEFLOW_VM_NO_SCHEDULER_THREAD

8dcc15d

Signed-off-by: daquexian <daquexian566@gmail.com>


          dtr zuijiandandebanbensihupaoqilaile version

e77d165

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix circular reference of compute op and ebo

0cd3833

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix lifecycle of view op

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix is_in_memory check to support 0-size tensor

78fb4f3


          fix tmp buffer size

c024ecf


          set parameters non evictable and add more tests

c7893d6


          do not set compute op for non evictable tensor

01a63ca

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix illegal memory access, the reason is the blob_bytes_ of storage i…

64146aa

…s not set so it is treated as 'in_memory' incorrectly


          fix segfault about ops_


          minor changes

20596a2


          refine test

588bde4

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix incorrect value produced by mutable update and fix unreleased par…

734818c

…ameters

Signed-off-by: daquexian <daquexian566@gmail.com>


          update

dcd558a

Signed-off-by: daquexian <daquexian566@gmail.com>


          add eager eviction in backward

29aab62

Signed-off-by: daquexian <daquexian566@gmail.com>


          fix bug when recomputing manually by the easiest way

47f6cf7

Signed-off-by: daquexian <daquexian566@gmail.com>


          add disjoint set

9df8159

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge branch 'master' into dtr4

d3760cc

Signed-off-by: daquexian <daquexian566@gmail.com>


          rename dtr cuda allocator to dtr ep allocator

727aefe

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge branch 'master' into dtr4

8a36654

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge branch 'master' into dtr4

44f4f7f


          add missing tensor storage header

c5a969c

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge branch 'master' into dtr4

68a0aa0

Signed-off-by: daquexian <daquexian566@gmail.com>


          update new env var name

e27c66e

Signed-off-by: daquexian <daquexian566@gmail.com>


          add resnet18 loss test and fix related bugs

8e9e982

Signed-off-by: daquexian <daquexian566@gmail.com>


          refine eviction_disabled

53f902b

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge remote-tracking branch 'origin/master' into dtr4

98c9c11


          ddp supports use_bucket=False

d52902c

Signed-off-by: daquexian <daquexian566@gmail.com>


          Merge branch 'master' into dtr4

6aed8c0

Signed-off-by: daquexian <daquexian566@gmail.com>

Contributor

github-actions bot commented Mar 29, 2023

CI failed when running job: cuda-speed-test. PR label automerge has been removed

github-actions bot removed the automerge label


          Merge branch 'master' into dtr5

e1a19d7

Contributor

github-actions bot commented Mar 31, 2023

Speed stats:


          fix bug

78b6e0a

Signed-off-by: daquexian <daquexian566@gmail.com>

Contributor

github-actions bot commented Apr 1, 2023

Speed stats:


          delete Singleton<remat::AllocatorManager> in env destructor

d08a672

Signed-off-by: daquexian <daquexian566@gmail.com>

Contributor

github-actions bot commented Apr 1, 2023

Speed stats:


          fix error on global tensor

53746ae

Signed-off-by: daquexian <daquexian566@gmail.com>

Contributor

github-actions bot commented Apr 1, 2023

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.0ms (= 14104.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.7ms (= 14268.5ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.01 (= 142.7ms / 141.0ms)

OneFlow resnet50 time: 82.1ms (= 8207.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 87.1ms (= 8709.4ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.06 (= 87.1ms / 82.1ms)

OneFlow resnet50 time: 51.0ms (= 10192.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.5ms (= 11895.6ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 59.5ms / 51.0ms)

OneFlow resnet50 time: 34.0ms (= 6804.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 46.2ms (= 9231.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.36 (= 46.2ms / 34.0ms)

OneFlow resnet50 time: 26.0ms (= 5198.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.1ms (= 8612.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.66 (= 43.1ms / 26.0ms)

OneFlow swin dataloader time: 0.234s (= 46.861s / 200, num_workers=1)
PyTorch swin dataloader time: 0.152s (= 30.442s / 200, num_workers=1)
Relative speed: 0.650 (= 0.152s / 0.234s)

OneFlow swin dataloader time: 0.067s (= 13.434s / 200, num_workers=4)
PyTorch swin dataloader time: 0.043s (= 8.697s / 200, num_workers=4)
Relative speed: 0.647 (= 0.043s / 0.067s)

OneFlow swin dataloader time: 0.040s (= 8.069s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.367s / 200, num_workers=8)
Relative speed: 0.541 (= 0.022s / 0.040s)

❌ OneFlow resnet50 time: 153.0ms (= 15296.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.7ms (= 16473.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 164.7ms / 153.0ms)

OneFlow resnet50 time: 93.4ms (= 9342.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 107.6ms (= 10755.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 107.6ms / 93.4ms)

OneFlow resnet50 time: 60.7ms (= 12137.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.2ms (= 16237.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 81.2ms / 60.7ms)

OneFlow resnet50 time: 43.1ms (= 8619.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.4ms (= 15280.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.77 (= 76.4ms / 43.1ms)

OneFlow resnet50 time: 37.2ms (= 7442.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.5ms (= 13496.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.81 (= 67.5ms / 37.2ms)


          skip loss tests in CI

ab8337c

Signed-off-by: daquexian <daquexian566@gmail.com>

Contributor

github-actions bot commented Apr 2, 2023

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.1ms (= 14108.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.2ms (= 14216.0ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.01 (= 142.2ms / 141.1ms)

OneFlow resnet50 time: 81.2ms (= 8123.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.8ms (= 8478.7ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 84.8ms / 81.2ms)

OneFlow resnet50 time: 50.6ms (= 10114.8ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 55.7ms (= 11130.5ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.10 (= 55.7ms / 50.6ms)

OneFlow resnet50 time: 33.4ms (= 6685.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.4ms (= 8672.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.30 (= 43.4ms / 33.4ms)

OneFlow resnet50 time: 25.9ms (= 5188.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.8ms (= 7953.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.53 (= 39.8ms / 25.9ms)

OneFlow swin dataloader time: 0.238s (= 47.586s / 200, num_workers=1)
PyTorch swin dataloader time: 0.148s (= 29.583s / 200, num_workers=1)
Relative speed: 0.622 (= 0.148s / 0.238s)

OneFlow swin dataloader time: 0.066s (= 13.174s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.218s / 200, num_workers=4)
Relative speed: 0.624 (= 0.041s / 0.066s)

OneFlow swin dataloader time: 0.043s (= 8.585s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.482s / 200, num_workers=8)
Relative speed: 0.522 (= 0.022s / 0.043s)

❌ OneFlow resnet50 time: 152.8ms (= 15280.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.4ms (= 16539.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.08 (= 165.4ms / 152.8ms)

OneFlow resnet50 time: 92.5ms (= 9247.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.7ms (= 10466.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 104.7ms / 92.5ms)

OneFlow resnet50 time: 61.1ms (= 12221.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.4ms (= 16074.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 80.4ms / 61.1ms)

OneFlow resnet50 time: 42.7ms (= 8536.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.2ms (= 14237.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.67 (= 71.2ms / 42.7ms)

OneFlow resnet50 time: 37.0ms (= 7395.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.4ms (= 13672.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.85 (= 68.4ms / 37.0ms)

Contributor

github-actions bot commented Apr 2, 2023

Speed stats:


          fix failed tests caused by adding the missing CHECK_JUST

c336edf

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian force-pushed the dtr5 branch from c19af5e to c336edf Compare

April 2, 2023 07:02

daquexian requested review from oneflow-ci-bot and removed request for oneflow-ci-bot

April 2, 2023 08:05


          Merge branch 'master' into dtr5

9b2c466

Contributor

github-actions bot commented Apr 2, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9861/

Contributor

github-actions bot commented Apr 2, 2023

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.1ms (= 14111.4ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 145.2ms (= 14515.8ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 145.2ms / 141.1ms)

OneFlow resnet50 time: 82.0ms (= 8200.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 92.8ms (= 9275.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 92.8ms / 82.0ms)

OneFlow resnet50 time: 51.4ms (= 10278.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 72.7ms (= 14536.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.41 (= 72.7ms / 51.4ms)

OneFlow resnet50 time: 33.6ms (= 6727.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 63.6ms (= 12726.1ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.89 (= 63.6ms / 33.6ms)

OneFlow resnet50 time: 26.9ms (= 5370.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 64.0ms (= 12807.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 2.38 (= 64.0ms / 26.9ms)

OneFlow swin dataloader time: 0.243s (= 48.693s / 200, num_workers=1)
PyTorch swin dataloader time: 0.158s (= 31.632s / 200, num_workers=1)
Relative speed: 0.650 (= 0.158s / 0.243s)

OneFlow swin dataloader time: 0.067s (= 13.368s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.427s / 200, num_workers=4)
Relative speed: 0.630 (= 0.042s / 0.067s)

OneFlow swin dataloader time: 0.042s (= 8.380s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.442s / 200, num_workers=8)
Relative speed: 0.530 (= 0.022s / 0.042s)

❌ OneFlow resnet50 time: 152.8ms (= 15277.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.4ms (= 16238.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.06 (= 162.4ms / 152.8ms)

OneFlow resnet50 time: 93.8ms (= 9377.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.2ms (= 10420.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 104.2ms / 93.8ms)

OneFlow resnet50 time: 61.2ms (= 12243.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.8ms (= 16154.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 80.8ms / 61.2ms)

OneFlow resnet50 time: 47.9ms (= 9581.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.6ms (= 14729.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.54 (= 73.6ms / 47.9ms)

OneFlow resnet50 time: 39.6ms (= 7911.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.0ms (= 14194.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.79 (= 71.0ms / 39.6ms)

daquexian and others added 2 commits

April 2, 2023 21:06


          move tests to expensive/

30a8321

Signed-off-by: daquexian <daquexian566@gmail.com>


          auto format by CI

fc16dda

Contributor

github-actions bot commented Apr 2, 2023

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

daquexian requested review from oneflow-ci-bot and removed request for oneflow-ci-bot

April 2, 2023 13:23

Contributor

github-actions bot commented Apr 2, 2023

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.1ms (= 14107.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 145.0ms (= 14495.9ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 145.0ms / 141.1ms)

OneFlow resnet50 time: 81.9ms (= 8188.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 93.1ms (= 9313.1ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 93.1ms / 81.9ms)

OneFlow resnet50 time: 51.9ms (= 10382.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 71.0ms (= 14206.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.37 (= 71.0ms / 51.9ms)

OneFlow resnet50 time: 33.9ms (= 6778.4ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 58.6ms (= 11728.1ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.73 (= 58.6ms / 33.9ms)

OneFlow resnet50 time: 27.4ms (= 5478.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 63.6ms (= 12722.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 2.32 (= 63.6ms / 27.4ms)

OneFlow swin dataloader time: 0.239s (= 47.894s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.027s / 200, num_workers=1)
Relative speed: 0.627 (= 0.150s / 0.239s)

OneFlow swin dataloader time: 0.071s (= 14.218s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.116s / 200, num_workers=4)
Relative speed: 0.571 (= 0.041s / 0.071s)

OneFlow swin dataloader time: 0.038s (= 7.688s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.552s / 200, num_workers=8)
Relative speed: 0.592 (= 0.023s / 0.038s)

❌ OneFlow resnet50 time: 152.9ms (= 15289.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.6ms (= 16360.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.07 (= 163.6ms / 152.9ms)

OneFlow resnet50 time: 92.7ms (= 9269.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.5ms (= 10452.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.13 (= 104.5ms / 92.7ms)

OneFlow resnet50 time: 66.0ms (= 13190.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.2ms (= 16038.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 80.2ms / 66.0ms)

OneFlow resnet50 time: 46.7ms (= 9334.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.4ms (= 14289.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 71.4ms / 46.7ms)

OneFlow resnet50 time: 43.3ms (= 8650.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.4ms (= 14681.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 73.4ms / 43.3ms)

Contributor

github-actions bot commented Apr 2, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9861/

daquexian added the automerge label

mergify bot merged commit 86c82db into master

mergify bot deleted the dtr5 branch

April 3, 2023 02:06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

mosout mosout approved these changes

BBuf BBuf approved these changes

hjchen2 Awaiting requested review from hjchen2 hjchen2 is a code owner

chengtbf Awaiting requested review from chengtbf

strint Awaiting requested review from strint strint is a code owner

oneflow-ci-bot Awaiting requested review from oneflow-ci-bot

Labels

automerge eager feature system