Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Inform the user when jobs status change #5337

Open

Description

Motivation

Some jobs may fail unexpectedly.
If the users can be informed when the jobs fail, the users will be able to handle the issue in time.
This will save the users from checking their job status all the time.

Similar for other status changes.

Background:

  • This function should be set by job instead of user
  • The trigger event can be
    • Start Running
    • Failed
    • Succeeded
    • WaitingTooLong
  • Notification can be sent to users by email / webportal and this should be configurable
    • some notification methods maybe not available if the admin doesn't enable it

Design

Workflow:

  • Part 1: Job configuration
  • Part 2: monitor & trigger corresponding alerts
  • Part 3: alerts handling

Part 1: Job / User configuration

  • What alerts to send is configured by job:
    • enable this feature in job protocal, in the field extras -> jobStatusChangeNotification
    • support further modification after jobs get submitted
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  hivedScheduler:
    taskRoles:
      taskrole:
        skuNum: 1
        skuType: GENERIC-WORKER
  jobStatusChangeNotification: 
    running: false
    succeeded: true
    stopped: false
    failed: true
    retried: false
  • How to send alerts is configured by user: set in user-profile page, user can select from these available actions:
    • webportal notification
    • email notification: this action will only be available when : 1) user email is not empty; 2) email-user action is available in alert-handler
{
    "username": "gusui",
    "email": "gusui@microsoft.com",
    "extension": {
        "sshKeys": [],
        "getJobStatusChangeNotificationBy": 
            email: true,
            webportal: true
    },
}

Part 2: monitor & trigger corresponding alerts

design with DB

  • add the following columns to the framework table in DB:
notificationAtRunning | BOOLEAN
notifiedAtRunning | BOOLEAN
notificationAtSucceeded | BOOLEAN
notifiedAtSucceeded | BOOLEAN
notificationAtFailed | BOOLEAN
notifiedAtFailed | BOOLEAN
notificationAtRetried | BOOLEAN
notifiedAtRetried | INTERGER (the Nth retry has been notified)

these columns are used to save job config & alerts state

  • add a container framework-status-notification-poller in alert-manager, which
    • watch DB framework table
    • send the alert when the config is enabled & the alert has not been sent
    • update framework table after successfully sending alerts to alert-manager

Part 3: alerts handling

  • src/alert-manager/deploy/alert-manager-configmap.yaml: add a new receiver and a new route
  • alert-handler: add an email template inform-user-job-status-change

Archive

Problems of watching k8s Framework object: not stable, may miss certain status change

Proposal 1

  • add a container framework-status-notification-poller in alert-manager, which
    • watch framework through k8s API
    • send the alert when a framework fails & this feature is enabled

Proposal 2

  • Job Exporter:

    • add a container, which monitor Framework status & export the following metric:
      • job_status(job_name="demo_job", username="demo_user",virtual_cluster="nni", status="running", pai_service_name="job-exporter", notification_status=["succeed", "failed"])
      • value: 0/1/2/3 (waiting/running/succeed/failed)
      • export value only at job status changes instead of exporting with a fixed frequency
  • Benefits: useful for averageWaitingTime, failingRate, & other statistics

  • Prometheus:

- alert: PAIJobFSucceed
  expr: max by (job_name) job_status{notification_status.includes("succeed")}[1m] == 2
  labels: 
    severity: warn
# - alert: PAIJobFailed
#   expr: changes(job_status{failureNotification="true"}[1m]) > 0 and job_status == 3
#   labels: 
#     severity: warn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions