[BUG]: Production DFP: CUDA out of memory on inference

### Version

23.01

### Which installation method(s) does this occur on?

Conda

### Describe the bug.

We get a CUDA memory error during Inference. I'm not able to make it through 24 hours of data before capping out on GPU memory. Appears that there is not enough memory on single GPU after loading our models.


### Minimum reproducible example

_No response_

### Relevant log output

```shell
Input data rate: 88758777 messages [12:38, 117715.93 messageE20230220 15:37:11.462033 726762 context.cpp:124] linear_segment_0/dfp-inference-7; rank: 0; size: 1; tid: 139759150155520: set_exception issued; issuing kill to current runnable. Exception msg: RuntimeError: CUDA out of memory. Tried to allocate 7.04 GiB (GPU 0; 79.35 GiB total capacity; 20.52 GiB already allocated; 3.85 GiB free; 21.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
At:  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(132): forward
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/torch/nn/modules/module.py(1130): _call_impl
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(546): encode
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(1075): get_results
  /opt/morpheus/install/morpheus/morpheus/dfp/stages/dfp_inference_stage.py(86): on_data                                                                      E20230220 15:37:11.484414 726055 runner.cpp:189] Runner::await_join - an exception was caught while awaiting on one or more contexts/instances - rethrowing
```


### Full env printout

_No response_

### Other/Misc.

_No response_

### Code of Conduct

- [X] I agree to follow Morpheus' Code of Conduct
- [X] I have searched the [open bugs](https://github.com/nv-morpheus/Morpheus/issues?q=is%3Aopen+is%3Aissue+label%3Abug) and have found no duplicates for this bug report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Production DFP: CUDA out of memory on inference #724

efajardo-nv
openedon Feb 21, 2023

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Code of Conduct

Assignees

Labels

Type

Projects

Milestone

Relationships

Development