Skip to content

Add scalability good practices doc #740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 6, 2017
Merged

Conversation

gmarek
Copy link
Contributor

@gmarek gmarek commented Jun 20, 2017

No description provided.

@gmarek gmarek self-assigned this Jun 20, 2017
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2017
@gmarek gmarek force-pushed the throughput branch 4 times, most recently from c08d6aa to 79aaa10 Compare June 20, 2017 16:31
to profile a component is to create an ssh tunnel to the machine running it, and run `go tool pprof localhost:<your_tunnel_port>` locally

## Summary
Summing it up, when writing code you should:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean this to be a bulleted list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@thockin thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2017
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just very minor nits. Other than that LGTM.

By "breaking scalability" we mean causing performance SLO violation in one of our performance tests. Performance SLOs for Kubernetes are <sup>[2](#2)</sup>:
- 99th percentile of API call latencies <= 1s
- 99th percentile of e2e Pod startup, excluding image pulling, latencies <= 5s
Tests that we run are Density and Load, we invite everyone interested in details to read the code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the line here (see how it looks in md file).

to profile a component is to create an ssh tunnel to the machine running it, and run `go tool pprof localhost:<your_tunnel_port>` locally

## Summary
Summing it up, when writing code you should:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


So for each `Secret` periodically we were GETting it's value and updating underlying variables/volumes if necessary. We have the same logic for `ConfigMaps`. Everything was great until we turned on `ServiceAccount` admission controller in our performance tests. Then everything went to hell for a very simple reason. `ServiceAccount` admission controller creates a `Secret` that it attached to every `Pod` (different one in every Namespace, but this doesn't change anything). Multiply above behavior by 150k. Given refresh period of 60s it meant that additional 2.5k QPS went to the API server, which of course blew up.

What we did to mitigate this issue was, in a way, reimplementing Informers using GETs instead of WATCHes. Current solution consists of a `Secret` cache shared between all `Pod`s. When a `Pod` wants to check if `Secret` changed to looks into the cache. If the `Secret` stored in the cache is too old, cache issues a GET request to API server to get current value. As `Pods` within a single `Namespace` share the `Secret` for `ServiceAccount` it means that Kubelet will need to refresh the `Secret` only once a while per `Namespace`, not per `Pod`, as it was before. This of course is a stopgap, not a final solution, which is currently (as of early May 2017) being designed as a "Bulk Watch".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link the design at the end: #443

@gmarek gmarek force-pushed the throughput branch 3 times, most recently from 3f030ba to 8724777 Compare July 6, 2017 10:01
@gmarek
Copy link
Contributor Author

gmarek commented Jul 6, 2017

Done. @wojtek-t PTAL

@wojtek-t
Copy link
Member

wojtek-t commented Jul 6, 2017

/lgtm

@wojtek-t wojtek-t merged commit b5c9eae into kubernetes:master Jul 6, 2017
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants