Skip to content

GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

@bilzard

Description

@bilzard

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Training YOLO from a checkpoint (*.pt) consumes more GPU memory than training from a pre-trained weight (i.e. yolov5l).

Environment

  • YOLO: YOLOv5 (latest; how to check the yolo version?)
  • CUDA: 11.6 (Tesla T4, 15360MiB)
  • OS: Ubuntu 18.04.6 LTS (Bionic Beaver)
  • Python: 3.8.12

Minimal Reproducible Example

In the below training command, case 2 requires more GPU memory than case 1.

# 1. train from pre-trained model
train.py ... --weights yolov5l

# 2. train from pre-trained checkpoint
train.py ... --weights pre_trained_checkpoint.pt

Additional

As reported on the pytorch forum[1], loading state dict on CUDA device causes memory leak. We should load it on CPU memory:

state_dict = torch.load(directory, map_location=lambda storage, loc: storage)

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions