You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, i have a problem integrating DeepSpeed and PyG.
In particular Setting 32 precision on Lightning Trainer on single GPU Quadro RTX 6000 everything works fine. Something similar to the issue in #8426, i guess.
But, switching to 16 precision i have the following Traceback calling Trainer.fit() (even calling torch.Tensor.half() on model, or on input, or both).
Traceback (most recent call last):
File "/projects/pyg/user/project/main_git.py", line 73, in <module>
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=test_loader)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1103, in _run
results = self._run_stage()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1182, in _run_stage
self._run_train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run_train
self._run_sanity_check()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1267, in _run_sanity_check
val_loop.run()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1485, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/strategies/deepspeed.py", line 917, in validation_step
return self.model(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1836, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/overrides/base.py", line 110, in forward
return self._forward_module.validation_step(*inputs, **kwargs)
File "/projects/pyg/user/project/GraphNN.py", line 125, in validation_step
return self.step(batch)
File "/projects/pyg/user/project/GraphNN.py", line 139, in step
net_outputs = self(x, batch.edge_index, batch.edge_attr, batch.batch)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/projects/pyg/user/project/GraphNN.py", line 103, in forward
x = f(x, edge_index, edge_attr)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 166, in forward
Z = self._calculate_update_gate(X, edge_index, edge_weight, H, lambda_max)
File "/usr/local/lib/python3.8/dist-packages/torch_geometric_temporal/nn/recurrent/gconv_gru.py", line 120, in _calculate_update_gate
Z = self.conv_x_z(X, edge_index, edge_weight, lambda_max=lambda_max)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/conv/cheb_conv.py", line 170, in forward
out = self.lins[0](Tx_0)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch_geometric/nn/dense/linear.py", line 136, in forward
return F.linear(x, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/linear.py", line 116, in zero3_linear_wrap
return LinearFunctionForZeroStage3.apply(input, weight)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/autocast_mode.py", line 110, in decorate_fwd
return fwd(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/linear.py", line 61, in forward
output = input.matmul(weight.t())
RuntimeError: expected scalar type Float but found Half
Environment
Environment
System info:
OS: Ubuntu 20.04.4 LTS
GPU count and types: 1 x Quadro RTX 6000
Python version: Python 3.8.10
Bug description
Hi, i have a problem integrating DeepSpeed and PyG.
In particular Setting 32 precision on Lightning Trainer on single GPU Quadro RTX 6000 everything works fine. Something similar to the issue in #8426, i guess.
But, switching to 16 precision i have the following Traceback calling
Trainer.fit()
(even callingtorch.Tensor.half()
on model, or on input, or both).How to reproduce the bug
Error messages and logs
Environment
Environment
System info:
OS: Ubuntu 20.04.4 LTS
GPU count and types: 1 x Quadro RTX 6000
Python version: Python 3.8.10
Pip installed libraries:
torch==1.12.1+cu116
torch-cluster==1.6.0+pt112cu116
torch-geometric==2.2.0
torch-geometric-temporal==0.54.0
torch-scatter==2.1.0+pt112cu116
torch-sparse==0.6.16+pt112cu116
torch-spline-conv==1.2.1+pt112cu116
torchaudio==0.12.1+cu116
torchfile==0.1.0
torchmetrics==0.9.3
torchvision==0.13.1+cu116
DeepSpeed 0.8.0
pytorch-lightning==1.9.0
More info
No response
cc @awaelchli
The text was updated successfully, but these errors were encountered: