Skip to content

[BUG]: Production DFP: CUDA out of memory on inference #724

Open

Description

Version

23.01

Which installation method(s) does this occur on?

Conda

Describe the bug.

We get a CUDA memory error during Inference. I'm not able to make it through 24 hours of data before capping out on GPU memory. Appears that there is not enough memory on single GPU after loading our models.

Minimum reproducible example

No response

Relevant log output

Input data rate: 88758777 messages [12:38, 117715.93 messageE20230220 15:37:11.462033 726762 context.cpp:124] linear_segment_0/dfp-inference-7; rank: 0; size: 1; tid: 139759150155520: set_exception issued; issuing kill to current runnable. Exception msg: RuntimeError: CUDA out of memory. Tried to allocate 7.04 GiB (GPU 0; 79.35 GiB total capacity; 20.52 GiB already allocated; 3.85 GiB free; 21.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
At:  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(132): forward
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/torch/nn/modules/module.py(1130): _call_impl
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(546): encode
  /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(1075): get_results
  /opt/morpheus/install/morpheus/morpheus/dfp/stages/dfp_inference_stage.py(86): on_data                                                                      E20230220 15:37:11.484414 726055 runner.cpp:189] Runner::await_join - an exception was caught while awaiting on one or more contexts/instances - rethrowing

Full env printout

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdfp[Workflow] Related to the Digital Fingerprinting (DFP) workflow

    Type

    No type

    Projects

    • Status

      Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions