This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
Inform the user when jobs status change #5337
Open
Description
openedon Mar 2, 2021
Motivation
Some jobs may fail unexpectedly.
If the users can be informed when the jobs fail, the users will be able to handle the issue in time.
This will save the users from checking their job status all the time.
Similar for other status changes.
Background:
- This function should be set by job instead of user
- The trigger event can be
- Start Running
- Failed
- Succeeded
- WaitingTooLong
- Notification can be sent to users by email / webportal and this should be configurable
- some notification methods maybe not available if the admin doesn't enable it
Design
Workflow:
- Part 1: Job configuration
- Part 2: monitor & trigger corresponding alerts
- Part 3: alerts handling
Part 1: Job / User configuration
- What alerts to send is configured by job:
- enable this feature in job protocal, in the field
extras
->jobStatusChangeNotification
- support further modification after jobs get submitted
- enable this feature in job protocal, in the field
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
hivedScheduler:
taskRoles:
taskrole:
skuNum: 1
skuType: GENERIC-WORKER
jobStatusChangeNotification:
running: false
succeeded: true
stopped: false
failed: true
retried: false
- How to send alerts is configured by user: set in user-profile page, user can select from these available actions:
- webportal notification
- email notification: this action will only be available when : 1) user email is not empty; 2)
email-user
action is available inalert-handler
{
"username": "gusui",
"email": "gusui@microsoft.com",
"extension": {
"sshKeys": [],
"getJobStatusChangeNotificationBy":
email: true,
webportal: true
},
}
Part 2: monitor & trigger corresponding alerts
design with DB
- add the following columns to the
framework
table in DB:
notificationAtRunning | BOOLEAN
notifiedAtRunning | BOOLEAN
notificationAtSucceeded | BOOLEAN
notifiedAtSucceeded | BOOLEAN
notificationAtFailed | BOOLEAN
notifiedAtFailed | BOOLEAN
notificationAtRetried | BOOLEAN
notifiedAtRetried | INTERGER (the Nth retry has been notified)
these columns are used to save job config & alerts state
- add a container
framework-status-notification-poller
inalert-manager
, which- watch DB
framework
table - send the alert when the config is enabled & the alert has not been sent
- update
framework
table after successfully sending alerts toalert-manager
- watch DB
Part 3: alerts handling
src/alert-manager/deploy/alert-manager-configmap.yaml
: add a newreceiver
and a newroute
alert-handler
: add an email templateinform-user-job-status-change
Archive
Problems of watching k8s Framework
object: not stable, may miss certain status change
Proposal 1
- add a container
framework-status-notification-poller
inalert-manager
, which- watch framework through k8s API
- send the alert when a framework fails & this feature is enabled
Proposal 2
-
Job Exporter:
- add a container, which monitor Framework status & export the following metric:
- job_status(job_name="demo_job", username="demo_user",virtual_cluster="nni", status="running", pai_service_name="job-exporter", notification_status=["succeed", "failed"])
- value: 0/1/2/3 (waiting/running/succeed/failed)
- export value only at job status changes instead of exporting with a fixed frequency
- add a container, which monitor Framework status & export the following metric:
-
Benefits: useful for
averageWaitingTime
,failingRate
, & other statistics -
Prometheus:
- alert: PAIJobFSucceed
expr: max by (job_name) job_status{notification_status.includes("succeed")}[1m] == 2
labels:
severity: warn
# - alert: PAIJobFailed
# expr: changes(job_status{failureNotification="true"}[1m]) > 0 and job_status == 3
# labels:
# severity: warn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment