Closed
Description
Hi,
Not sure if you have plan on upgrading DLRM code to ipex 1.10. I tried to upgrade the DLRM code with ipex 1.10 based on the patch from https://github.com/intel/intel-extension-for-pytorch/blob/0.2/torch_patches/models/0001-enable-dlrm-distributed-training-for-cpu.patch and noticed performance regression.
Micro benchmark showed that all_to_all had 2x worse performance after upgrading ipex 1.10. Any idea?
system config:
- torch ccl 1.10, pytorch 1.10, ipex 1.10
- single node, 2 ranks per node
all2all ipex v0.2:
all2all ipex 1.10:
test code:
import torch
import os
import extend_distributed as ext_dist
if __name__ == "__main__":
ext_dist.init_distributed(backend='ccl')
inputs = []
tensor1 = torch.ones(262144, 16, dtype=torch.bfloat16)
tensor2 = torch.ones(262144, 16, dtype=torch.bfloat16)
inputs.append(tensor1)
inputs.append(tensor2)
with torch.autograd.profiler.profile(True) as prof:
for _ in range(10):
a2a_req = ext_dist.alltoall(inputs, None)
ly_sparse = a2a_req.wait()
print(prof.key_averages().table(sort_by="cpu_time_total"))
Thanks
Metadata
Metadata
Assignees
Labels
No labels