Skip to content

Webserver fails with Cuda OutOfMemory #294

@JB91451

Description

@JB91451

Dear DiffDock developers,

currently the webserver at huggingface fails every run with a cuda out-of-memory issue. Could you please have a look?

Below is an example of the typical error report.

Best regards,
Juergen

Standard out:
Generating ESM language model embeddings
Processing 1 of 1 batches (1 sequences)

Standard error:

libgomp: Invalid value for environment variable OMP_NUM_THREADS
/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
[2025-Sep-01 14:57:08 UTC] [inference.py:153] INFO - DiffDock will run on cuda
[2025-Sep-01 14:57:19 UTC] [inference.py:184] INFO - Confidence model uses different type of graphs than the score model. Loading (or creating if not existing) the data for the confidence model now.
[2025-Sep-01 14:57:28 UTC] [inference.py:223] INFO - Size of test dataset: 1

0it [00:00, ?it/s][2025-Sep-01 14:57:40 UTC] [process_mols.py:309] DEBUG - rdkit coords could not be generated. trying again 1.
[2025-Sep-01 14:57:52 UTC] [process_mols.py:309] DEBUG - rdkit coords could not be generated. trying again 2.
[2025-Sep-01 14:58:04 UTC] [process_mols.py:313] INFO - rdkit coords could not be generated without using random coords. using random coords now.
/home/appuser/DiffDock/datasets/parse_chi.py:91: RuntimeWarning: invalid value encountered in cast
  Y = indices.astype(int)
[2025-Sep-01 14:59:44 UTC] [process_mols.py:309] DEBUG - rdkit coords could not be generated. trying again 1.
[2025-Sep-01 14:59:56 UTC] [process_mols.py:309] DEBUG - rdkit coords could not be generated. trying again 2.
[2025-Sep-01 15:00:08 UTC] [process_mols.py:313] INFO - rdkit coords could not be generated without using random coords. using random coords now.
--- Logging error ---
Traceback (most recent call last):
  File "/home/appuser/DiffDock/inference.py", line 260, in main
    data_list, confidence = sampling(data_list=data_list, model=model,
  File "/home/appuser/DiffDock/utils/sampling.py", line 116, in sampling
    tr_score, rot_score, tor_score = model(mod_complex_graph_batch)[:3]
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/appuser/DiffDock/models/cg_model.py", line 345, in forward
    node_attr = self.conv_layers[l](node_attr, edge_index, edge_attr_, edge_sh, edge_weight=edge_weight)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/appuser/DiffDock/models/tensor_layers.py", line 321, in forward
    out = tp_scatter_multigroup(self.tp, self.fc, node_attr, edge_index, edge_attr, edge_sh,
  File "/home/appuser/DiffDock/models/tensor_layers.py", line 211, in tp_scatter_multigroup
    cur_out_irreps = cur_fc(edge_attr_groups[ii])
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.96 GiB (GPU 0; 14.74 GiB total capacity; 2.62 GiB already allocated; 11.61 GiB free; 2.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/home/appuser/micromamba/envs/diffdock/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/appuser/DiffDock/inference.py", line 318, in <module>
    main(_args)
  File "/home/appuser/DiffDock/inference.py", line 302, in main
    logger.warning("Failed on", orig_complex_graph["name"], e)
Message: 'Failed on'
Arguments: (['complex_0'], OutOfMemoryError('CUDA out of memory. Tried to allocate 14.96 GiB (GPU 0; 14.74 GiB total capacity; 2.62 GiB already allocated; 11.61 GiB free; 2.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'))

1it [03:16, 196.77s/it]
1it [03:16, 196.77s/it]
[2025-Sep-01 15:00:45 UTC] [inference.py:310] WARNING - 
    Failed for 1 / 1 complexes.
    Skipped 0 / 1 complexes.

[2025-Sep-01 15:00:45 UTC] [inference.py:313] INFO - Results saved in /tmp/tmp_ykwbabp

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions