Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move inf_or_nan_tracker to cpu for cpu offload #5826

Merged
merged 6 commits into from
Aug 16, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
move inf_or_nan_tracker to cpu for cpu offload
Change-Id: Ib395ddc91605abd593ad070ab7b6f453e982174f
  • Loading branch information
BacharL committed Aug 5, 2024
commit 89dc96339da51ecd383af1fee3a6834ea1896980
10 changes: 4 additions & 6 deletions deepspeed/runtime/zero/stage3.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,14 +215,12 @@ def __init__(
self.module = module
self.elastic_checkpoint = elastic_checkpoint

self.inf_or_nan_tracker: Tensor = torch.zeros(1,
dtype=torch.bool,
device=get_accelerator().current_device_name(),
requires_grad=False)
self.device = get_accelerator().current_device_name() if not self.offload_optimizer else OffloadDeviceEnum.cpu

self.inf_or_nan_tracker: Tensor = torch.zeros(1, dtype=torch.bool, device=self.device, requires_grad=False)

self.deepspeed_adam_offload = (self.offload_optimizer and type(init_optimizer) == DeepSpeedCPUAdam)

self.device = get_accelerator().current_device_name() if not self.offload_optimizer else OffloadDeviceEnum.cpu
### streams used for overlapping computation with communication
self.reduce_and_partition_stream = None if get_accelerator().is_synchronized_device() else get_accelerator(
).Stream() if overlap_comm else get_accelerator().default_stream()
Expand Down Expand Up @@ -2146,7 +2144,7 @@ def has_overflow(self, partition_gradients=True):
self.inf_or_nan_tracker += torch.isnan(self.grad_partitions_flat_buffer).any()
self.inf_or_nan_tracker = self.inf_or_nan_tracker > 0

overflow_gpu = self.inf_or_nan_tracker.clone().to(torch.uint8)
overflow_gpu = self.inf_or_nan_tracker.clone().to(get_accelerator().current_device()).to(torch.uint8)
self.inf_or_nan_tracker.zero_()

if not get_accelerator().resolves_data_dependency():
Expand Down