Improve Monitoring/Alerting/Metrics

# Story
As a provider I want timely alerts raised based on the metrics to take informed decisions

# Motivation
- MCM exposes a number of metrics like number of API calls to Cloud Provider, Freeze status of MCM #189 
- Define some alerts based on the metrics will help the Ops to react in a timely manner, in case of any action required
- Challenges with Azure Cloud Provider during deletion of machines #200    

# Acceptance Criteria
- [ ] Define alerts for the above situations to take required action 

# Definition of Done
- [ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
- [ ] Unit tests are provided: Have you written automated unit tests?
- [ ] Integration tests are provided: Have you written automated integration tests?
- [ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
- [ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
- [ ] User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?

*Possible metrices to add (Rough work)* 
- we could provide metrices on number of machines with different statuses , so filtering on that can be done (if already not exposed)
- metrics about time taken for machine to join can be added, this will help to know overall average joining time on any provider
- when MCM did scale-up , scale-down and when CA did.
- metices that could solve typical DoD issues, like node not joining.
- how much each resource took to get created like VM, disk especially in Azure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Monitoring/Alerting/Metrics #211

PadmaB
openedon Jan 24, 2019

Story

Motivation

Acceptance Criteria

Definition of Done

Assignees

Labels

Type

Projects

Milestone

Relationships

Development