Support for pytorch 2.0 #94

RaulPPelaez · 2023-03-17T18:18:55Z

This PR is to start working on making NNPOps compatible with pytorch 2.0 and torch.compile

Ensure the workflows and tests work for pytorch 2.0 (wich requires CUDA 11.8)
Make operations compatible with torch.compile()
Hopefully this will not require a lot of work, but reducing the number of graph breaks torch.compile will introduce will probably be more challenging. I reckon avoiding graph breaks is really similar to making stuff compatible with CUDA graphs.
Write tests for compiled versions
In principle compiled functions/models should be completely equivalent to the uncompiled versions, but I have seen this not being the case (granted, torch2 was still a beta)

…der does not include 11.8

environment.yml

RaulPPelaez · 2023-04-13T14:07:40Z

Torchani cannot be installed with pytorch2, which forces to skip some tests.
EDIT: Torchani made a new torch2 compatible release

RaulPPelaez · 2023-04-14T09:50:48Z

All tests pass and the ci works for an installation with pytorch2. I believe this should be merged now and work on compile() compatibility be done in another PR.
A new release could be done now so that users can install NNPOps along pytorch2.

RaulPPelaez · 2023-04-14T10:25:56Z

In case you have some experience with torch2 compile:
This test miserably fails in CUDA mode:

@pytest.mark.parametrize('device', ['cpu', 'cuda'])
@pytest.mark.parametrize('dtype', [pt.float32, pt.float64])
def test_torch_compile_compatible(device, dtype):

    class ForceModule(pt.nn.Module):

        def forward(self, positions):

            neighbors, deltas, distances = getNeighborPairs(positions, cutoff=1.0)
            mask = pt.isnan(distances)
            distances = distances[~mask]
            return pt.sum(distances**2)

    original_model = ForceModule()
    num_atoms=10
    positions = (20 * pt.randn((num_atoms, 3), device=device, dtype=dtype)) - 10
    original_model(positions)
    model = pt.compile(original_model)
    model(positions)

It yields a really verbose error about something called FakeTensor that makes the most obscure gcc recursive template error look clear and informative:

TestNeighbors.py::test_torch_compile_compatible[dtype1-cuda] FAILED                                                                                                [600/1860]
                                                                                                                                                                             
================================================================================= FAILURES ==================================================================================
________________________________________________________________ test_torch_compile_compatible[dtype0-cuda] _________________________________________________________________
                                                                                                                                                                             
output_graph = <torch._dynamo.output_graph.OutputGraph object at 0x7fc3e7d81fc0>, node = get_neighbor_pairs                                                                  
args = (FakeTensor(FakeTensor(..., device='meta', size=(10, 3)), cuda:0), 1.0, -1, FakeTensor(FakeTensor(..., device='meta', size=(0, 0)), cuda:0)), kwargs = {}             
nnmodule = None                                                                                                                                                              
                                                                                                                                                                             
    def run_node(output_graph, node, args, kwargs, nnmodule):                                                                                                                
        """                                                                                                                                                                  
        Runs a given node, with the given args and kwargs.
     
        Behavior is dicatated by a node's op.
     
        run_node is useful for extracting real values out of nodes.
        See get_real_value for more info on common usage.
     
        Note: The output_graph arg is only used for 'get_attr' ops
        Note: The nnmodule arg is only used for 'call_module' ops
     
        Nodes that are not call_function, call_method, call_module, or get_attr will
        raise an AssertionError.
        """
        op = node.op
        try:
            if op == "call_function":
>               return node.target(*args, **kwargs)

../../../mambaforge/envs/nnpops-torch2-nvidia/lib/python3.10/site-packages/torch/_dynamo/utils.py:1194: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <OpOverloadPacket(op='neighbors.getNeighborPairs')>
args = (FakeTensor(FakeTensor(..., device='meta', size=(10, 3)), cuda:0), 1.0, -1, FakeTensor(FakeTensor(..., device='meta', size=(0, 0)), cuda:0)), kwargs = {}

    def __call__(self, *args, **kwargs):
        # overloading __call__ to ensure torch.ops.foo.bar()
        # is still callable from JIT
        # We save the function ptr as the `op` attribute on
        # OpOverloadPacket to access it here.
>       return self._op(*args, **kwargs or {})
E       RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data()
 or raw_mutable_data() to actually allocate memory.

../../../mambaforge/envs/nnpops-torch2-nvidia/lib/python3.10/site-packages/torch/_ops.py:502: RuntimeError

and following for a gazillion lines.

I have not been able to solve this, from what I have gathered this should not happen and it is a bug in torch (there are a lot of issues describing stuff like this: pytorch/pytorch#96742 pytorch/pytorch#95791

raimis · 2023-04-14T12:15:34Z

Yes, we can skip the compile feature for now.

.github/workflows/ci.yml

environment.yml

RaulPPelaez · 2023-04-14T14:40:04Z

Ok I think this is done now.

.github/workflows/ci.yml

environment.yml

raimis · 2023-04-17T09:18:48Z

@RaulPPelaez can I merge?

RaulPPelaez · 2023-04-17T14:32:27Z

Yes, thanks. @raimis

Update environments to support torch 2.0

dcb409c

This was referenced Mar 17, 2023

error when trying to jit.script getNeighbors #92

Closed

Missing CUDA 11.8? Jimver/cuda-toolkit#213

Closed

Make latest CUDA version in CI be 11.7, since the CUDA workflow provi…

65e480d

…der does not include 11.8

sef43 reviewed Mar 20, 2023

View reviewed changes

environment.yml Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/master' into torch2_compile

6ab5e03

RaulPPelaez mentioned this pull request Apr 13, 2023

New Release? aiqm/torchani#620

Closed

RaulPPelaez added 8 commits April 13, 2023 16:34

Update ci

5539929

Update ci

879c180

Update ci

a426d00

Update ci

7a9fc64

Add latest torchani, compatible with pytorch 2

ae32a74

update ci

be4e0cc

update ci

c9b9044

Update ci

66b545b

raimis reviewed Apr 14, 2023

View reviewed changes

raimis requested review from raimis and sef43 and removed request for sef43 April 14, 2023 12:24

RaulPPelaez added 2 commits April 14, 2023 16:03

Address raimis comments

07e4357

Fix cuda

f01b38c

raimis reviewed Apr 14, 2023

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

raimis reviewed Apr 14, 2023

View reviewed changes

environment.yml Outdated Show resolved Hide resolved

Remove = in environment.yml

f6f0f0c

raimis approved these changes Apr 17, 2023

View reviewed changes

raimis mentioned this pull request Apr 17, 2023

Rebuild for pytorch20 conda-forge/nnpops-feedstock#19

Closed

raimis merged commit b63fc70 into openmm:master Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for pytorch 2.0 #94

Support for pytorch 2.0 #94

RaulPPelaez commented Mar 17, 2023 •

edited

Loading

RaulPPelaez commented Apr 13, 2023 •

edited

Loading

RaulPPelaez commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

raimis commented Apr 17, 2023

RaulPPelaez commented Apr 17, 2023

Support for pytorch 2.0 #94

Support for pytorch 2.0 #94

Conversation

RaulPPelaez commented Mar 17, 2023 • edited Loading

RaulPPelaez commented Apr 13, 2023 • edited Loading

RaulPPelaez commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

raimis commented Apr 17, 2023

RaulPPelaez commented Apr 17, 2023

RaulPPelaez commented Mar 17, 2023 •

edited

Loading

RaulPPelaez commented Apr 13, 2023 •

edited

Loading