Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TMVA] cuDNN LSTM backpropagation test fails on ubuntu2404 cuda 12.6.1 #16790

Open
1 task done
hageboeck opened this issue Oct 30, 2024 · 1 comment
Open
1 task done

Comments

@hageboeck
Copy link
Member

Check duplicate issues.

  • Checked for duplicates

Description

The test TMVA-DNN-LSTM-BackpropagationCudnn crashes on ubuntu2404 cuda-12.6.1 with cudnn with the following stack trace:

   0x00007fda7f0b5540 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7ed1491e in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7f08f040 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7ed0ef22 in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fda7eed2bae in <unknown> from /usr/lib64/libcuda.so.1
   0x00007fdaaa248b01 in <unknown> from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x00007fdaaa218baa in <unknown> from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x00007fdaaa270721 in cudaMemcpy + 0x211 from /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12
   0x000055d25af29e37 in bool testLSTMBackpropagation<TMVA::DNN::TCudnn<double> >(unsigned long, unsigned long, unsigned long, unsigned long, TMVA::DNN::TCudnn<double>::Scalar_t, std::vector<bool, std::allocator<bool> >, bool) + 0x4d37 from /github/home/ROOT-CI/build/tmva/tmva/test/DNN/LSTM/testLSTMBackpropagationCudnn

Specifically, it's the assignment in this loop:

for (size_t i = 0; i < batchSize; ++i) {
auto mat = XArch[i];
for (size_t l = 0; l < (size_t) XArch[i].GetNrows(); ++l) {
for (size_t m = 0; m < (size_t) XArch[i].GetNcols(); ++m) {
mat(l, m) = gRandom->Uniform(-1,1);
//XArch[i](0, 0) = 0.5;
//XArch[i](1, 0) = 0.5;
}
}
}
}

Which triggers a cuda_memcpy to the GPU. The crash happens somewhere in the cuda library. Other cudnn tests work, so the problem is not necessarily a broken installation.

Reproducer

cmake -Dtmva-gpu=On -Dtesting=On <src>
ctest -R TMVA-DNN-LSTM-BackpropagationCudnn

ROOT version

Master

Installation method

Source

Operating system

ubuntu24 docker container with cuda 12.6.1

Additional context

No response

@lmoneta
Copy link
Member

lmoneta commented Nov 12, 2024

I could not reproduce this issue on my machine, ubuntu24 with cuda 12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants