Skip to content

CUDA kernel error when using VBE #2502

Closed
@xiexbing

Description

@xiexbing

Hello, there. I am using the newly released torchrec for our model training. I use VBE to reduce the data deduplication in the embedding lookup and communication in the forward pass.
Specifically, I used VBE (variable batch embedding) in my code and runs the code on 8 GPUs (rank0-rank7). The code consistently failed on the forward pass, with rank 1 always returning the correct lookup results. the other ranks always returning the following error . To exhibit the error in the minimal setting, I reduced the model size to fit into 2 GPUs. the error is shown in rank 0, and rank 1 still return the correct lookup results.

here is the error
" Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f59c88cd897 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f59c887db25 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f59c89a5718 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f59c9ba3e36 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f59c9ba7f38 in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f59c9bad5ac in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f59c9bae31c in /opt/conda/envs/ptca/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd3b75 (0x7f5a1565cb75 in /opt/conda/envs/ptca/bin/../lib/libstdc++.so.6)
frame #8: + 0x8609 (0x7f5a17e91609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f5a17c5c353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'"

I add the "feature_inverse_indices" as below.
...................................................................................................
sparse_feature_df = batch.batch_data[0].select(feature)
sparse_feature_df = sparse_feature_df.with_columns(sparse_feature_df[feature].apply(lambda x: "".join([str(i) for i in x])).alias("id"))
unique_df = sparse_feature_df.unique(maintain_order=True)
print(f"{feature}, sparse_feature_df: {sparse_feature_df.shape}, unique_df: {unique_df.shape}")
df_with_index = sparse_feature_df.join(unique_df.with_row_count(), on="id", how="left")
inverse_indices = df_with_index["row_nr"].to_numpy(zero_copy_only=True)
sparse_feature_inverse_indices.append(inverse_indices)
.........................................................................................................
sparse_feature_inverse_indices = torch.tensor(np.stack(sparse_feature_inverse_indices), dtype=torch.int64)
..........................................................................................................

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions