Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add typed errors, add errors_total metric #74

Merged
merged 4 commits into from
May 19, 2017

Conversation

MaciekPytel
Copy link
Contributor

Ref: #59

@MaciekPytel MaciekPytel added this to the CA-0.6 milestone May 16, 2017
@MaciekPytel MaciekPytel requested a review from mwielgus May 16, 2017 15:08
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 16, 2017
}
if len(readyNodes) == 0 {
glog.Error("No ready nodes in the cluster")
scaleDown.CleanUpUnneededNodes()
return
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're returning nil because it doesn't look like an error?
Then, should we turn glog.Error into glog.Warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think having an empty cluster is an autoscaler error, so I'm returning nil here. I will update log levels accordingly.

So far I've only did enough in core/ to get this PR to compile, so I could quickly get feedback on the idea of introducing typed errors. We have a lot of places where returning error signal failure of some function, but not an actual error condition that should be reported as such (e.g. not scaling up due to reaching max cluster size is expected autoscaler behavior). So I will have to make a lot of changes all over the codebase and I wanted to get other opinions before I do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
if len(allNodes) == 0 {
glog.Error("No nodes in the cluster")
scaleDown.CleanUpUnneededNodes()
return
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -157,24 +158,24 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) {
fixedSomething, err := fixNodeGroupSize(autoscalingContext, time.Now())
if err != nil {
glog.Warningf("Failed to fix node group sizes: %v", err)
return
return errors.ToAutoscalerError(errors.CloudProviderError, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an error, should we log it as so? More concretely by turning glog.Warningf into glog.Errorf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@MaciekPytel MaciekPytel changed the title WIP: Add typed errors, add errors_total metric Add typed errors, add errors_total metric May 18, 2017
To keep reasonable commit size only top-level files use
new errors. Will add them in other files in next commits.
Updated some of the functions called by scale up
to return new errors as required.
@MaciekPytel
Copy link
Contributor Author

MaciekPytel commented May 18, 2017

Since there is quite a bit of changes in various places in this PR, this is how I'm testing it:

  • do a few manual tests with scaling a deployment up and down to see if it works as expected [passed]
  • use iptables to remove connectivity to cloud provider, see if errors_total{type="cloudProviderError"} increase, remove iptables rule and see if everything works as expected and errors_total no longer increase [passed]
  • introduce a few "bugs" in code resulting in internal errors, confirm errors_total is growing as expected [passed]
  • hit maximum cluster size, resulting in scale up returning TransientErrors, errors_total should not grow [passed]
  • introduce a "bug" in code randomly failing a request to API server with ~10% probability, observe if everything works as expected and errors_total{type="apiServerError"} increases at reasonable rate (i.e. about 1/100s) [passed]
  • run e2e test suite, confirm it passes and all metrics have reasonable values after it's done [passed]

@mwielgus
Copy link
Contributor

LGTM

@mwielgus mwielgus merged commit 9083c13 into kubernetes:master May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants