Skip to content

Trainer.predict on really large dataset cause CPU out-of-memory #15656

Closed
@junwang-wish

Description

@junwang-wish

Bug description

Trainer.predict(model, datamodule) on sufficiently large data would cause CPU out-of-memory due to the fact that results are appended to a list during predict (this is true even if setting return_predictions=False): https://github.com/Lightning-AI/lightning/blob/4e8cf85b0cd5128adcec3f3ad0f2254f417ae1ee/src/pytorch_lightning/loops/dataloader/prediction_loop.py#L103

What is the correct way of running prediction on a dataset that is orders of magnitude larger than CPU memory?

How to reproduce the bug

# Just always return `None` in `predict_step` and track ur memory usage:
def predict_step(self, batch, batch_idx):
    import objgraph
    objgraph.show_growth(limit=3)
    return None

Error messages and logs


# You will see memory for type list will increment at every prediction step like below
list    11320        +1

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs triageWaiting to be triaged by maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions