Skip to content

[BUG] (Very) sporadic BatchedMatrixTests failure in CI  #3652

Open
@dantegd

Description

@dantegd

Describe the bug
This test fail seems not happen often at all, but we just saw it in #3589 in the CUDA 11 job: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,OS=ubuntu18.04,PYTHON=3.8/842/ (might need to see the full log since Jenkins didn't report the fail explicitly).

For reference, the failure is:

[ RUN      ] BatchedMatrixTests/BatchedMatrixTestF.Result/8
../test/prims/batched/matrix.cu:471: Failure
Value of: raft::devArrMatchHost( res_h.data(), res_bM->raw_data(), res_h.size(), raft::CompareApprox<float>(params.tolerance), stream)
  Actual: false (actual=2.826171875 != expected=2.8580241203308105 @1; actual=0.97265625 != expected=0.95727932453155518 @2; actual=0.73828125 != expected=0.74853849411010742 @4; actual=0.046875 != expected=0.033891439437866211 @9; actual=0.609375 != expected=0.62402057647705078 @10; actual=0.80859375 != expected=0.82160842418670654 @14; actual=1.0859375 != expected=1.1064639091491699 @15; )
Expected: true
[  FAILED  ] BatchedMatrixTests/BatchedMatrixTestF.Result/8, where GetParam() = 36-byte object <06-00 00-00 06-00 00-00 11-00 00-00 11-00 00-00 01-00 00-00 01-00 00-00 00-00 00-00 00-00 00-00 0A-D7 23-3C> (2 ms)

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information):

  • Environment location: CUDA 11 PR CI instance
  • Linux Distro/Architecture: Ubuntu 18.04
  • GPU Model/Driver: A100 - 40GB
  • CUDA: 11.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions