Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nn.Graph] Fix eager tensor not in job bug #8114

Merged
merged 17 commits into from
Apr 30, 2022

Conversation

BBuf
Copy link
Contributor

@BBuf BBuf commented Apr 28, 2022

If eager free tensor not in nn.Graph's job, such as num_batches_tracked parameter in bn, the bug happened in compile!

with the help of xuxiaoyu, I have fixed this by add a map find check for variable in job.

@strint strint added graph graph mode bug labels Apr 28, 2022
@github-actions
Copy link
Contributor

CI failed when running job: Build cu102_xla. PR label automerge has been removed

@BBuf BBuf added the automerge label Apr 29, 2022
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8114/

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8114/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.2ms (= 12924.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.5ms (= 14248.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.5ms / 129.2ms)

OneFlow resnet50 time: 82.5ms (= 8254.3ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.1ms (= 8612.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.04 (= 86.1ms / 82.5ms)

OneFlow resnet50 time: 55.3ms (= 11050.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11553.6ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.05 (= 57.8ms / 55.3ms)

OneFlow resnet50 time: 43.1ms (= 8611.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 43.6ms (= 8712.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.01 (= 43.6ms / 43.1ms)

OneFlow resnet50 time: 37.7ms (= 7546.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.4ms (= 7675.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.02 (= 38.4ms / 37.7ms)

OneFlow swin dataloader time: 0.262s (= 52.386s / 200, num_workers=1)
PyTorch swin dataloader time: 0.149s (= 29.814s / 200, num_workers=1)
Relative speed: 0.569 (= 0.149s / 0.262s)

OneFlow swin dataloader time: 0.068s (= 13.614s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 7.978s / 200, num_workers=4)
Relative speed: 0.586 (= 0.040s / 0.068s)

OneFlow swin dataloader time: 0.060s (= 11.943s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.378s / 200, num_workers=8)
Relative speed: 0.367 (= 0.022s / 0.060s)

❌ OneFlow resnet50 time: 144.7ms (= 14473.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.5ms (= 16951.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 169.5ms / 144.7ms)

OneFlow resnet50 time: 100.8ms (= 10083.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 122.5ms (= 12253.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 122.5ms / 100.8ms)

OneFlow resnet50 time: 74.9ms (= 14987.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.4ms (= 17682.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 88.4ms / 74.9ms)

OneFlow resnet50 time: 65.0ms (= 12996.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.6ms (= 14928.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.15 (= 74.6ms / 65.0ms)

OneFlow resnet50 time: 57.4ms (= 11478.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.2ms (= 15045.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 75.2ms / 57.4ms)

@mergify mergify bot merged commit aecd923 into master Apr 30, 2022
@mergify mergify bot deleted the fix_eager_tensor_not_in_job_bug branch April 30, 2022 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants