Open
Description
openedon Jan 24, 2019
Story
As a provider I want timely alerts raised based on the metrics to take informed decisions
Motivation
- MCM exposes a number of metrics like number of API calls to Cloud Provider, Freeze status of MCM Enhance metrics endpoint #189
- Define some alerts based on the metrics will help the Ops to react in a timely manner, in case of any action required
- Challenges with Azure Cloud Provider during deletion of machines Improve logic of VM-Deletion and Safey-controller for Azure #200
Acceptance Criteria
- Define alerts for the above situations to take required action
Definition of Done
- Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
- Unit tests are provided: Have you written automated unit tests?
- Integration tests are provided: Have you written automated integration tests?
- Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
- Operations guide: Have you updated the operations guide about ops-relevant changes?
- User documentation: Have you updated the READMEs/docs/how-tos about user-relevant changes?
Possible metrices to add (Rough work)
- we could provide metrices on number of machines with different statuses , so filtering on that can be done (if already not exposed)
- metrics about time taken for machine to join can be added, this will help to know overall average joining time on any provider
- when MCM did scale-up , scale-down and when CA did.
- metices that could solve typical DoD issues, like node not joining.
- how much each resource took to get created like VM, disk especially in Azure.
Metadata
Assignees
Labels
Monitoring (including availability monitoring and alerting) relatedEffort for issue is around 1 monthEnhancement, improvement, extensionNobody worked on this for 6 months (will further age)Needs (more) planning with other MCM maintainersPriority (lower number equals higher priority)Affects Seed clusters
Activity