Open
Description
Describe the bug
This test fail seems not happen often at all, but we just saw it in #3589 in the CUDA 11 job: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu18.04,PYTHON=3.8/842/ (might need to see the full log since Jenkins didn't report the fail explicitly).
For reference, the failure is:
[ RUN ] BatchedMatrixTests/BatchedMatrixTestF.Result/8
../test/prims/batched/matrix.cu:471: Failure
Value of: raft::devArrMatchHost( res_h.data(), res_bM->raw_data(), res_h.size(), raft::CompareApprox<float>(params.tolerance), stream)
Actual: false (actual=2.826171875 != expected=2.8580241203308105 @1; actual=0.97265625 != expected=0.95727932453155518 @2; actual=0.73828125 != expected=0.74853849411010742 @4; actual=0.046875 != expected=0.033891439437866211 @9; actual=0.609375 != expected=0.62402057647705078 @10; actual=0.80859375 != expected=0.82160842418670654 @14; actual=1.0859375 != expected=1.1064639091491699 @15; )
Expected: true
[ FAILED ] BatchedMatrixTests/BatchedMatrixTestF.Result/8, where GetParam() = 36-byte object <06-00 00-00 06-00 00-00 11-00 00-00 11-00 00-00 01-00 00-00 01-00 00-00 00-00 00-00 00-00 00-00 0A-D7 23-3C> (2 ms)
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information):
- Environment location: CUDA 11 PR CI instance
- Linux Distro/Architecture: Ubuntu 18.04
- GPU Model/Driver: A100 - 40GB
- CUDA: 11.0