Description
Hello, there. I am using the newly released torchrec for our model training. I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass.
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning the following error . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results.
here is the error
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'"
I add the "feature_inverse_indices" as below.
...................................................................................................
sparse_feature_df = batch.batch_data[0].select(feature)
sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
unique_df = sparse_feature_df.unique(maintain_order=True)
print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................