[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin

### Bug summary

I trained a pytorch version of the BiFeO3 model with DPSPIN. When I use the model to do minimization, with only 1600 atoms, I get an error that CUDA out of memory. The machine type is c12_m92_1 * NVIDIA V100. 

I had previously run DPLR with 1-2w atoms normally, and even the normal DP-tf model with more atoms.  For DPSPIN-tf, 1,600 atoms are also far from the limit. But for DPSPIN-pytorch, you can't do that anymore.

The ERROR is listed below: 
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0001
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 154, in forward_lower
    vvi = split_vv1[_44]
    svvi = split_svv1[_44]
    _45 = _36(vvi, svvi, coord_ext, do_virial, do_atomic_virial, )
          ~~~ <--- HERE
    ffi, aviri, = _45
    ffi0 = torch.unsqueeze(ffi, -2)
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 201, in task_deriv_one
    extended_virial0 = torch.matmul(_53, torch.unsqueeze(extended_coord, -2))
    if do_atomic_virial:
      extended_virial_corr = _50(extended_coord, atom_energy, )
                             ~~~ <--- HERE
      extended_virial2 = torch.add(extended_virial0, extended_virial_corr)
      extended_virial1 = extended_virial2
  File "code/__torch__/deepmd/pt/model/model/transform_output.py", line 234, in atomic_virial_corr
    ops.prim.RaiseException("AssertionError: ")
    extended_virial_corr00 = _55
  _61 = torch.autograd.grad([sumce1], [extended_coord], lst, None, True)
        ~~~~~~~~~~~~~~~~~~~ <--- HERE
  extended_virial_corr1 = _61[0]
  _62 = torch.__isnot__(extended_virial_corr1, None)

Traceback of TorchScript, original code (most recent call last):
  File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 120, in forward_lower
    for vvi, svvi in zip(split_vv1, split_svv1):
        # nf x nloc x 3, nf x nloc x 9
        ffi, aviri = task_deriv_one(
                     ~~~~~~~~~~~~~~ <--- HERE
            vvi,
            svvi,
  File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 76, in task_deriv_one
        # the correction sums to zero, which does not contribute to global virial
        if do_atomic_virial:
            extended_virial_corr = atomic_virial_corr(extended_coord, atom_energy)
                                   ~~~~~~~~~~~~~~~~~~ <--- HERE
            extended_virial = extended_virial + extended_virial_corr
        # to [...,3,3] -> [...,9]
  File "/opt/mamba/envs/DeepSpin_devel/lib/python3.9/site-packages/deepmd/pt/model/model/transform_output.py", line 39, in atomic_virial_corr
    )[0]
    assert extended_virial_corr0 is not None
    extended_virial_corr1 = torch.autograd.grad(
                            ~~~~~~~~~~~~~~~~~~~ <--- HERE
        [sumce1], [extended_coord], grad_outputs=lst, create_graph=True
    )[0]
RuntimeError: CUDA out of memory. Tried to allocate 220.00 MiB. GPU 0 has a total capacty of 31.74 GiB of which 202.12 MiB is free. Process 19403 has 31.54 GiB memory in use. Of the allocated memory 30.12 GiB is allocated by PyTorch, and 425.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF



### DeePMD-kit Version

DeePMD-kit v3.0.0a1.dev107+ga26b6803.d20240430

### Backend and its version

torch v2.1.0+cu118

### How did you download the software?

Offline packages

### Input Files, Running Commands, Error Log, etc.

[test.zip](https://github.com/user-attachments/files/16189110/test.zip)

### Steps to Reproduce

lmp_mpi -i input.lammps

### Further Information, Files, and Links

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] CUDA out of memory, when only 1600 atoms, using the pytorch model with spin #3969

Description

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions