Add scalability good practices doc #740

gmarek · 2017-06-20T14:25:46Z

No description provided.

thockin · 2017-07-01T00:00:39Z

contributors/devel/scalability-good-practices.md

+to profile a component is to create an ssh tunnel to the machine running it, and run `go tool pprof localhost:<your_tunnel_port>` locally
+
+## Summary
+Summing it up, when writing code you should:


Did you mean this to be a bulleted list?

wojtek-t

Just very minor nits. Other than that LGTM.

wojtek-t · 2017-07-04T08:29:35Z

contributors/devel/scalability-good-practices.md

+By "breaking scalability" we mean causing performance SLO violation in one of our performance tests. Performance SLOs for Kubernetes are <sup>[2](#2)</sup>:
+- 99th percentile of API call latencies <= 1s
+- 99th percentile of e2e Pod startup, excluding image pulling, latencies <= 5s
+Tests that we run are Density and Load, we invite everyone interested in details to read the code.


Break the line here (see how it looks in md file).

wojtek-t · 2017-07-04T08:31:28Z

contributors/devel/scalability-good-practices.md

+to profile a component is to create an ssh tunnel to the machine running it, and run `go tool pprof localhost:<your_tunnel_port>` locally
+
+## Summary
+Summing it up, when writing code you should:


wojtek-t · 2017-07-04T08:34:19Z

contributors/devel/scalability-good-practices.md

+
+So for each `Secret` periodically we were GETting it's value and updating underlying variables/volumes if necessary. We have the same logic for `ConfigMaps`. Everything was great until we turned on `ServiceAccount` admission controller in our performance tests. Then everything went to hell for a very simple reason. `ServiceAccount` admission controller creates a `Secret` that it attached to every `Pod` (different one in every Namespace, but this doesn't change anything). Multiply above behavior by 150k. Given refresh period of 60s it meant that additional 2.5k QPS went to the API server, which of course blew up.
+
+What we did to mitigate this issue was, in a way, reimplementing Informers using GETs instead of WATCHes. Current solution consists of a `Secret` cache shared between all `Pod`s. When a `Pod` wants to check if `Secret` changed to looks into the cache. If the `Secret` stored in the cache is too old, cache issues a GET request to API server to get current value. As `Pods` within a single `Namespace` share the `Secret` for `ServiceAccount` it means that Kubelet will need to refresh the `Secret` only once a while per `Namespace`, not per `Pod`, as it was before. This of course is a stopgap, not a final solution, which is currently (as of early May 2017) being designed as a "Bulk Watch".


You can link the design at the end: #443

gmarek · 2017-07-06T10:03:55Z

Done. @wojtek-t PTAL

wojtek-t · 2017-07-06T10:22:54Z

/lgtm

gmarek self-assigned this Jun 20, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2017

gmarek force-pushed the throughput branch 4 times, most recently from c08d6aa to 79aaa10 Compare June 20, 2017 16:31

gmarek assigned timothysc, smarterclayton, thockin and wojtek-t and unassigned gmarek Jun 20, 2017

thockin reviewed Jul 1, 2017

View reviewed changes

thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2017

wojtek-t reviewed Jul 4, 2017

View reviewed changes

gmarek force-pushed the throughput branch 3 times, most recently from 3f030ba to 8724777 Compare July 6, 2017 10:01

Add scalability good practices doc

7660fa6

gmarek force-pushed the throughput branch from 8724777 to 7660fa6 Compare July 6, 2017 10:02

wojtek-t merged commit b5c9eae into kubernetes:master Jul 6, 2017

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Add myself to the members (kubernetes#740)

a270cca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add scalability good practices doc #740

Add scalability good practices doc #740

Uh oh!

gmarek commented Jun 20, 2017

Uh oh!

thockin Jul 1, 2017

Uh oh!

wojtek-t Jul 4, 2017

Uh oh!

gmarek Jul 6, 2017

Uh oh!

wojtek-t left a comment

Uh oh!

wojtek-t Jul 4, 2017

Uh oh!

wojtek-t Jul 4, 2017

Uh oh!

wojtek-t Jul 4, 2017

Uh oh!

gmarek commented Jul 6, 2017

Uh oh!

wojtek-t commented Jul 6, 2017

Uh oh!

Uh oh!


		So for each `Secret` periodically we were GETting it's value and updating underlying variables/volumes if necessary. We have the same logic for `ConfigMaps`. Everything was great until we turned on `ServiceAccount` admission controller in our performance tests. Then everything went to hell for a very simple reason. `ServiceAccount` admission controller creates a `Secret` that it attached to every `Pod` (different one in every Namespace, but this doesn't change anything). Multiply above behavior by 150k. Given refresh period of 60s it meant that additional 2.5k QPS went to the API server, which of course blew up.

		What we did to mitigate this issue was, in a way, reimplementing Informers using GETs instead of WATCHes. Current solution consists of a `Secret` cache shared between all `Pod`s. When a `Pod` wants to check if `Secret` changed to looks into the cache. If the `Secret` stored in the cache is too old, cache issues a GET request to API server to get current value. As `Pods` within a single `Namespace` share the `Secret` for `ServiceAccount` it means that Kubelet will need to refresh the `Secret` only once a while per `Namespace`, not per `Pod`, as it was before. This of course is a stopgap, not a final solution, which is currently (as of early May 2017) being designed as a "Bulk Watch".

Add scalability good practices doc #740

Add scalability good practices doc #740

Uh oh!

Conversation

gmarek commented Jun 20, 2017

Uh oh!

thockin Jul 1, 2017

Choose a reason for hiding this comment

Uh oh!

wojtek-t Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

gmarek Jul 6, 2017

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

wojtek-t Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

wojtek-t Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

wojtek-t Jul 4, 2017

Choose a reason for hiding this comment

Uh oh!

gmarek commented Jul 6, 2017

Uh oh!

wojtek-t commented Jul 6, 2017

Uh oh!

Uh oh!