CUDA kernel error when using VBE

Hello, there. I am using the newly released torchrec for our model training.  I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass. 
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning ****the following error**** . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results. 

**here is the error** 
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'"   

**I add the  "feature_inverse_indices" as below**.  
...................................................................................................
   sparse_feature_df = batch.batch_data[0].select(feature)
                sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
                unique_df = sparse_feature_df.unique(maintain_order=True)
                print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
                df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
                inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
                sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
 sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA kernel error when using VBE #2502

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA kernel error when using VBE #2502

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions