Skip to content

Segmentation fault (core dumped) in UnitTestCuVectorAddRowSumMat for large matrix dimensions #4458

Open
@nayakajay

Description

@nayakajay

Tested on: Titan RTX, cuda 11.0, Driver 455.51.05
The system has abundant RAM (>100G) and is a Intel Xeon processor

I have been trying to run the tests provided in cu-matrix-test.cc. I am interested in a particular test, UnitTestCuVectorAddRowSumMat. To run only 1 particular test, I have commented all the other tests in "CudaMatrixUnitTest" function and have modified the "main" function in the test file as

int main() {
  SetVerboseLevel(1);
  int32 loop = 0;

#if HAVE_CUDA == 1
  for (loop = 1; loop < 2; loop++) {
    CuDevice::Instantiate().SetDebugStrideMode(true);
    if (loop == 0)
      CuDevice::Instantiate().SelectGpuId("no");
    else
      CuDevice::Instantiate().SelectGpuId("yes");
#endif

    kaldi::CudaMatrixUnitTest<double>();

    if (loop == 0)
      KALDI_LOG << "Tests without GPU use succeeded.";
    else
      KALDI_LOG << "Tests with GPU use (if available) succeeded.";

#if HAVE_CUDA == 1
  } // No for loop if 'HAVE_CUDA != 1',
  CuDevice::Instantiate().PrintProfile();
#endif
  return 0;
}

As can be seen, I run the test only for double. In the test "UnitTestCuVectorAddRowSumMat", I give
X=65000, Y=64360 (well within limits of int32). I am observing segmentation faults in that case. For X=45000, Y=44550, the test runs successfully. Am I doing something wrong?

Sample output

$ ./cu-matrix-test
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:172) Manually selected to compute on CPU.
Segmentation fault (core dumped)

The GPU code is running fine, I think, the relevant output is (by setting loop=1 in the main shown earlier)

$ ./cu-matrix-test
WARNING ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:247) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:446) Selecting from 1 GPUs
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:461) cudaSetDevice(0): TITAN RTX   free:24048M, used:172M, total:24220M, free/total:0.992899
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:509) Device: 0, mem_ratio: 0.992899
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:390) Trying to select device: 0
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:519) Success selecting device 0 free mem ratio: 0.992899
LOG ([5.5.854~1-403d]:FinalizeActiveGpu():cu-device.cc:346) The active GPU is [0]: TITAN RTX    free:23566M, used:654M, total:24220M, free/total:0.972998 version 7.5
Segmentation fault (core dumped)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions