Description
Tested on: Titan RTX, cuda 11.0, Driver 455.51.05
The system has abundant RAM (>100G) and is a Intel Xeon processor
I have been trying to run the tests provided in cu-matrix-test.cc. I am interested in a particular test, UnitTestCuVectorAddRowSumMat. To run only 1 particular test, I have commented all the other tests in "CudaMatrixUnitTest" function and have modified the "main" function in the test file as
int main() {
SetVerboseLevel(1);
int32 loop = 0;
#if HAVE_CUDA == 1
for (loop = 1; loop < 2; loop++) {
CuDevice::Instantiate().SetDebugStrideMode(true);
if (loop == 0)
CuDevice::Instantiate().SelectGpuId("no");
else
CuDevice::Instantiate().SelectGpuId("yes");
#endif
kaldi::CudaMatrixUnitTest<double>();
if (loop == 0)
KALDI_LOG << "Tests without GPU use succeeded.";
else
KALDI_LOG << "Tests with GPU use (if available) succeeded.";
#if HAVE_CUDA == 1
} // No for loop if 'HAVE_CUDA != 1',
CuDevice::Instantiate().PrintProfile();
#endif
return 0;
}
As can be seen, I run the test only for double
. In the test "UnitTestCuVectorAddRowSumMat", I give
X=65000, Y=64360 (well within limits of int32). I am observing segmentation faults in that case. For X=45000, Y=44550, the test runs successfully. Am I doing something wrong?
Sample output
$ ./cu-matrix-test
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:172) Manually selected to compute on CPU.
Segmentation fault (core dumped)
The GPU code is running fine, I think, the relevant output is (by setting loop=1 in the main shown earlier)
$ ./cu-matrix-test
WARNING ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:247) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:446) Selecting from 1 GPUs
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:461) cudaSetDevice(0): TITAN RTX free:24048M, used:172M, total:24220M, free/total:0.992899
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:509) Device: 0, mem_ratio: 0.992899
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:390) Trying to select device: 0
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:519) Success selecting device 0 free mem ratio: 0.992899
LOG ([5.5.854~1-403d]:FinalizeActiveGpu():cu-device.cc:346) The active GPU is [0]: TITAN RTX free:23566M, used:654M, total:24220M, free/total:0.972998 version 7.5
Segmentation fault (core dumped)