cudagraphs dynamo backend #80566

ezyang · 2022-06-29T21:38:13Z

Stack from ghstack (oldest at bottom):

-> cudagraphs dynamo backend #80566

This backend handles cases where the preexisting cuda graphs
implementation from dynamo is unsound/has errors.

Requires this functorch bug fix: pytorch/functorch#935

Signed-off-by: Edward Z. Yang <ezyangfb.com>

These tests demonstrate cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

These tests demonstrate cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: 0e75258940714543465013512e07d877b6886e57 Pull Request resolved: #80566

facebook-github-bot · 2022-06-29T21:38:20Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/80566
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit ac5b125 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ezyang · 2022-06-29T21:38:41Z

cc @ngimel @wconstab

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

ezyang · 2022-07-01T02:45:12Z

cc @SherlockNoMad

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: 20158e76b4374cd5efb82a59bc72621b52e1f75c Pull Request resolved: #80566

test/test_dynamo_cudagraphs.py

torch/fx/passes/backends/cudagraphs.py

wconstab · 2022-07-01T04:53:10Z

test/test_dynamo_cudagraphs.py

+            return tree_map(cloner, self.static_outputs)
+
+        else:
+            # warmup


what does warmup actually do for graph recording?

our docs suggest to warm up for "a few iterations" (in their example they use 3) but idk what for.
https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/#api-example

It is for handling internal libraries which change what cuda kernels they call based on a cache. A simple example is cudnn benchmarking: the first run will trigger a bunch of benchmarking cuda kernels which you definitely don't want to record. According to the doc nccl needs something like 11 warmup iterations LOL

I think it's ddp, not nccl, but I'm not sure. Cudnn benchmarking throws an error if someone tries to capture it.

oops you're right

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: 19a440b3dd9c5cb31f951fc2e575b62b6bb1e053 Pull Request resolved: #80566

ezyang · 2022-07-06T04:49:01Z

Substantial refactor of many parts of the code, I also found a bunch of bugs which I will be filing issues for tomorrow

ezyang · 2022-07-06T04:49:35Z

The screamer bug is somehow cuda graphs, during recording, just loses input mutations. Crazy.

jansel · 2022-07-06T05:04:01Z

The screamer bug is somehow cuda graphs, during recording, just loses input mutations. Crazy.

Don't you copy all the inputs to invoke the cudagraphs? I'd expect the static_inputs to get mutated, then you need to copy those mutations to the real inputs.

ngimel · 2022-07-06T05:58:11Z

I don't think that's a bug, graph recording don't execute the kernels, it only records the succession of launches.

In [1]: import torch

In [2]: a=torch.ones(4, device="cuda")

In [3]: graph = torch.cuda.CUDAGraph()

In [4]: with torch.cuda.graph(graph):
   ...:     for _ in range(4):
   ...:         a += 1
   ...: 

In [5]: a
Out[5]: tensor([1., 1., 1., 1.], device='cuda:0') # expected, nothing happened to a

In [7]: graph.replay()

In [8]: a
Out[8]: tensor([5., 5., 5., 5.], device='cuda:0') # expected, now a is modified

In [9]: graph = torch.cuda.CUDAGraph()
In [10]: with torch.cuda.graph(graph):
    ...:     for _ in range(4):
    ...:         b=2*a
    ...: 

In [11]: b
Out[11]: tensor([0., 0., 0., 0.], device='cuda:0') # expected, b is uninitialized
In [12]: graph.replay()

In [13]: b
Out[13]: tensor([10., 10., 10., 10.], device='cuda:0') #expected, b is computed

ezyang · 2022-07-06T14:54:33Z

@ngimel is right. Need to update the docs to make this more clear lol

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

ezyang · 2022-07-06T14:57:28Z

test/test_dynamo_cudagraphs.py

+        super().run(*args)
+        return self.mutated_inputs
+
+class ProxyTensorInterpreter(torch.fx.Interpreter):


TODO: Move this into a shared area. @Chillee I wanted you to look over this

ezyang · 2022-07-06T14:57:42Z

test/test_dynamo_cudagraphs.py

+        mutated_inputs = FindInputMutations(submod)(*map(unwrap_elem, args))
+        # smh the module didn't get transferred wut
+        self.new_module.add_submodule(target, CudaGraphModule(submod, mutated_inputs))
+        return wrap_output(out, torch.fx.Proxy(self.new_graph.call_module(target, tree_map(unwrap_proxy_node, args), tree_map(unwrap_proxy_node, kwargs)), self.tracer))


This is kind of blegh

ezyang · 2022-07-06T14:57:55Z

test/test_dynamo_cudagraphs.py

+    with FakeTensorMode.push() as mode:
+        t.run(*map(mode.from_tensor, inputs))
+    model = t.new_module
+    model.recompile()


This is a lot wordier than it should be

ezyang · 2022-07-06T14:58:23Z

torch/fx/passes/backends/cudagraphs.py

+        # TODO: this is not compositional
+        with FakeTensorMode.push() as mode:
+            fake_args = [mode.from_tensor(a) for a in args]
+            return super().run(*fake_args)


This should be moved somewhere else

This pass isn't sound; because we save fake tensors directly on nodes, if a graph has metadata changing operation like resize_ it will mutate the fake tensor

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: 45b7ef1cd924b460857859435d8c188a3774d821 Pull Request resolved: #80566

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: 27084075f7c6d60ce0762ed7f3f94b92a7a6e9bd Pull Request resolved: #80566

ezyang · 2022-07-21T21:41:13Z

@pytorchbot merge -g

pytorchmergebot · 2022-07-21T21:42:38Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-07-21T21:52:43Z

Merge failed due to Refusing to merge as mandatory check(s) Lint failed for rule superuser
Raised by https://github.com/pytorch/pytorch/actions/runs/2714874815

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> ghstack-source-id: aaac41e05840ee08ff955cf95bc813b7f9f9e8df Pull Request resolved: #80566

ezyang · 2022-07-22T14:04:26Z

@pytorchbot merge

pytorchmergebot · 2022-07-22T14:06:01Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-07-22T14:06:53Z

Hey @ezyang.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

torch/fx/passes/backends/cudagraphs.py

msaroufim · 2022-07-22T17:16:57Z

test/test_dynamo_cudagraphs.py

+        def model(x, y):
+            return (x + y) * y
+
+        with torchdynamo.optimize(aot_autograd_cudagraphs):


n00b q: Is the optimization unrolled outside of the with scope? If you called torchdynamo.optimize() in a loop would the result be the same as calling it once?

this is a little subtle

First, the context manager is misleading. It doesn't actually turn on optimization for the inside of the manager. Optimization only turns on when you hit a new frame (e.g., do a function call).

With that out of the way, what if you have the optimization inside or outside of a loop? It will depend. If the loop successfully unrolls, then you will get a compiled graph outside the loop that has the unrolled graph. But let's say there's some reason we can't compile the outer frame. Then we will compile the inner function, and the two applications are equivalent/

msaroufim · 2022-07-22T17:20:33Z

test/test_dynamo_cudagraphs.py

+                loss = model(x, y).sum()
+                loss.backward()
+
+    @patch("functorch._src.config.use_functionalize", True)


noob q: I've seen the terminology functionalize a few times and my understanding is you remove stateful operations to ship to compilers that can't represent aliasing. Is that the majority of compilers? Some compilers we really care about?

EDIT: nvm saw Jason's comment about how CUDA graphs don't support input mutation

Functionalize gives you a graph that doesn't have mutating operations in it. CUDA graphs actually doesn't want functionalization, but we need it because there are passes like the partitioner we use here which are unsound in the presence of mutation.

torch/cuda/_dynamo_graphs.py

msaroufim · 2022-07-22T17:32:39Z

torch/cuda/_dynamo_graphs.py

+    # NB: we override __call__ as we don't need any nn.Module machinery
+    # and to reduce overhead
+    def __call__(self, *args):
+        # TODO: once we've recorded here, we'd like to replace the __call__


Is this comment the general PT 2.0 strategy where you do graph surgery by swapping graph nodes for compiled code?

Yes, this is a general pattern for doing compilation on FX graph directly.

torch/cuda/_dynamo_graphs.py

msaroufim · 2022-07-22T18:13:36Z

torch/cuda/_dynamo_graphs.py

+    # and to reduce overhead
+    def __call__(self, *args):
+        # TODO: once we've recorded here, we'd like to replace the __call__
+        # implementation with compiled bytecode that copies into static, replays


is graph capture what NVIDIA would call what we call tracing?

sort of. But our tracing (proxy tensor) operates at a different level than cuda graph capture. Our tracing capture aten ops; cuda graph captures cuda kernel launches

torch/fx/passes/backends/cudagraphs.py

Summary: This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> Pull Request resolved: #80566 Approved by: https://github.com/ngimel, https://github.com/wconstab Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/3c2c2cc9474b46238bf2f517762ab853b84bbf4d Reviewed By: osalpekar Differential Revision: D38114100 Pulled By: ezyang fbshipit-source-id: 3fb056e599cef605792cea9d794de701c596a9d8

Previously it was in pytorch/pytorch but it depends on torchdynamo code more closely, so this seems like the logical place. Previously at pytorch/pytorch#80566 Signed-off-by: Edward Z. Yang <ezyang@fb.com>

* Move aot_cudagraphs backend here Previously it was in pytorch/pytorch but it depends on torchdynamo code more closely, so this seems like the logical place. Previously at pytorch/pytorch#80566 Signed-off-by: Edward Z. Yang <ezyang@fb.com>

PoC tests for dynamo cudagraphs

e54c58b

These tests demonstrate cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

facebook-github-bot added the cla signed label Jun 29, 2022

ezyang mentioned this pull request Jun 30, 2022

SEV2 piecewise linear repro #80714

Closed

ezyang changed the title ~~PoC tests for dynamo cudagraphs~~ PoC more robust cudagraphs dynamo backend Jul 1, 2022

Update on "PoC more robust cudagraphs dynamo backend"

8e32792

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

Update on "PoC more robust cudagraphs dynamo backend"

de34437

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

jansel reviewed Jul 1, 2022

View reviewed changes

test/test_dynamo_cudagraphs.py Outdated Show resolved Hide resolved

test/test_dynamo_cudagraphs.py Outdated Show resolved Hide resolved

torch/fx/passes/backends/cudagraphs.py Show resolved Hide resolved

wconstab reviewed Jul 1, 2022

View reviewed changes

Update on "PoC more robust cudagraphs dynamo backend"

db6df04

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

Update on "PoC more robust cudagraphs dynamo backend"

bf4edca

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

ezyang requested a review from Chillee July 6, 2022 14:56

ezyang commented Jul 6, 2022

View reviewed changes

Update on "PoC more robust cudagraphs dynamo backend"

b5d6861

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

Update on "cudagraphs dynamo backend"

7f0b3d0

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

Update on "cudagraphs dynamo backend"

3b48239

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

Update on "cudagraphs dynamo backend"

ac5b125

This backend handles cases where the preexisting cuda graphs implementation from dynamo is unsound/has errors. Requires this functorch bug fix: pytorch/functorch#935 Signed-off-by: Edward Z. Yang <ezyangfb.com> [ghstack-poisoned]

pytorchmergebot added the Merged label Jul 22, 2022

pytorchmergebot closed this in 3c2c2cc Jul 22, 2022

msaroufim reviewed Jul 22, 2022

View reviewed changes

torch/fx/passes/backends/cudagraphs.py Show resolved Hide resolved

msaroufim reviewed Jul 22, 2022

View reviewed changes

torch/cuda/_dynamo_graphs.py Show resolved Hide resolved

msaroufim reviewed Jul 22, 2022

View reviewed changes

torch/cuda/_dynamo_graphs.py Show resolved Hide resolved

msaroufim reviewed Jul 22, 2022

View reviewed changes

torch/fx/passes/backends/cudagraphs.py Show resolved Hide resolved

facebook-github-bot deleted the gh/ezyang/1228/head branch July 25, 2022 14:18

ezyang mentioned this pull request Jul 28, 2022

CUDA Graph API Improvement #71896

Open

ngimel mentioned this pull request Aug 5, 2022

CUDA graph capturing fails for nn.Embedding and large batch sizes #82886

Open

ezyang mentioned this pull request Aug 9, 2022

Move aot_cudagraphs backend here pytorch/torchdynamo#757

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudagraphs dynamo backend #80566

cudagraphs dynamo backend #80566

ezyang commented Jun 29, 2022 •

edited

Loading

facebook-github-bot commented Jun 29, 2022 •

edited

Loading

ezyang commented Jun 29, 2022

ezyang commented Jul 1, 2022

wconstab Jul 1, 2022

ezyang Jul 1, 2022

ngimel Jul 1, 2022

ezyang Jul 4, 2022

ezyang commented Jul 6, 2022

ezyang commented Jul 6, 2022

jansel commented Jul 6, 2022

ngimel commented Jul 6, 2022 •

edited

Loading

ezyang commented Jul 6, 2022

ezyang Jul 6, 2022

ezyang Jul 6, 2022

ezyang Jul 6, 2022

ezyang Jul 6, 2022

ezyang Jul 11, 2022

ezyang commented Jul 21, 2022

pytorchmergebot commented Jul 21, 2022

pytorchmergebot commented Jul 21, 2022

ezyang commented Jul 22, 2022

pytorchmergebot commented Jul 22, 2022

github-actions bot commented Jul 22, 2022

msaroufim Jul 22, 2022

ezyang Jul 22, 2022

msaroufim Jul 22, 2022 •

edited

Loading

ezyang Jul 22, 2022

msaroufim Jul 22, 2022

ezyang Jul 22, 2022

msaroufim Jul 22, 2022

ezyang Jul 22, 2022

cudagraphs dynamo backend #80566

cudagraphs dynamo backend #80566

Conversation

ezyang commented Jun 29, 2022 • edited Loading

facebook-github-bot commented Jun 29, 2022 • edited Loading

🔗 Helpful links

✅ No Failures (0 Pending)

ezyang commented Jun 29, 2022

ezyang commented Jul 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jul 6, 2022

ezyang commented Jul 6, 2022

jansel commented Jul 6, 2022

ngimel commented Jul 6, 2022 • edited Loading

ezyang commented Jul 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jul 21, 2022

pytorchmergebot commented Jul 21, 2022

pytorchmergebot commented Jul 21, 2022

ezyang commented Jul 22, 2022

pytorchmergebot commented Jul 22, 2022

github-actions bot commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim Jul 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Jun 29, 2022 •

edited

Loading

facebook-github-bot commented Jun 29, 2022 •

edited

Loading

ngimel commented Jul 6, 2022 •

edited

Loading

msaroufim Jul 22, 2022 •

edited

Loading