Skip to content

[ML] anomaly detection Job failed without reason audit message when model snapshot is too old #80621

Closed
@wwang500

Description

@wwang500

Steps to reproduce (easier to do this from a cloud env)

  • Deploy a 7.8.1 cluster and start a real-time anomaly detection job,
  • Then create a 8.0 cluster by restoring that 7.8.1 cluster snapshot,
  • after 8.0 cluster successfully was created, login to check the AD job status:

AD job failed on UI, and there is no failed reason being thrown, neither in job message. (At the time of 11:56 on the below screenshot)

Screen Shot 2021-11-10 at 9 50 03 AM

After stop datafeed and force-close job (at the time of 16:01 on above screen), restarting job gave the root cause of job failure, which was caused by old model snapshot:

{
  "error": {
    "root_cause": [
      {
        "type": "exception",
        "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
      }
    ],
    "type": "exception",
    "reason": "[response_code_rates] job snapshot [1636472812] has min version before [7.0.0], please revert to a newer model snapshot or reset the job"
  },
  "status": 500
}

Expected behaviour:
Good failure message on UI and job messages

Metadata

Metadata

Assignees

No one assigned

    Labels

    :mlMachine learning>bugTeam:MLMeta label for the ML team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions