Open
Description
Version
23.01
Which installation method(s) does this occur on?
Conda
Describe the bug.
We get a CUDA memory error during Inference. I'm not able to make it through 24 hours of data before capping out on GPU memory. Appears that there is not enough memory on single GPU after loading our models.
Minimum reproducible example
No response
Relevant log output
Input data rate: 88758777 messages [12:38, 117715.93 messageE20230220 15:37:11.462033 726762 context.cpp:124] linear_segment_0/dfp-inference-7; rank: 0; size: 1; tid: 139759150155520: set_exception issued; issuing kill to current runnable. Exception msg: RuntimeError: CUDA out of memory. Tried to allocate 7.04 GiB (GPU 0; 79.35 GiB total capacity; 20.52 GiB already allocated; 3.85 GiB free; 21.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
At: /home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(132): forward
/home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/torch/nn/modules/module.py(1130): _call_impl
/home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(546): encode
/home/a6005028/aml-cyber-fraud/python/morpheus_env/lib/python3.8/site-packages/dfencoder/autoencoder.py(1075): get_results
/opt/morpheus/install/morpheus/morpheus/dfp/stages/dfp_inference_stage.py(86): on_data E20230220 15:37:11.484414 726055 runner.cpp:189] Runner::await_join - an exception was caught while awaiting on one or more contexts/instances - rethrowing
Full env printout
No response
Other/Misc.
No response
Code of Conduct
- I agree to follow Morpheus' Code of Conduct
- I have searched the open bugs and have found no duplicates for this bug report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Metadata
Assignees
Type
Projects
Status
Todo