New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

prr: start of pilot policy doc #4181

Merged

k8s-ci-robot merged 1 commit into kubernetes:master from johnbelamaric:prod-readiness

Oct 19, 2019

Member

johnbelamaric commented Oct 17, 2019

No description provided.

k8s-ci-robot added cncf-cla: yes size/M approved labels

k8s-ci-robot requested review from derekwaynecarr and dims

October 17, 2019 17:30

k8s-ci-robot added area/developer-guide sig/architecture labels

wojtek-t reviewed

View reviewed changes

contributors/devel/sig-architecture/production-readiness.md Outdated

		@@ -0,0 +1,51 @@
		# Production Readiness Review Process

Member

wojtek-t Oct 17, 2019

nit: I would move it to sig-architecture/

[there is already api-review process doc there]

Member Author

johnbelamaric Oct 17, 2019

it's already there?

Member Author

johnbelamaric Oct 17, 2019

oh, i see

Member

wojtek-t Oct 17, 2019

I meant not: contributors/devel/sig-architecture, just simply sig-architecture

So basically here:
https://github.com/kubernetes/community/tree/master/sig-architecture

contributors/devel/sig-architecture/production-readiness.md Outdated

+              # Production Readiness Review Process
+              Production readiness reviews are intended to ensure that features merging into
+              Kubernetes are observable and supportable, can be safely operated in production

Member

wojtek-t Oct 17, 2019

and scalable ?

Contributor

k8s-ci-robot commented Oct 17, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sig-architecture/OWNERS~~ [johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Member

wojtek-t commented Oct 18, 2019

@johnbelamaric - please squash the commits and I will LGTM. I would like to merge quick and iterate - the doc already makes it clear that it's "under development" and not fully figured out.


          Initial PRR pilot policy doc

f79dd3e

johnbelamaric force-pushed the prod-readiness branch from b867d56 to f79dd3e Compare

October 18, 2019 16:12

Member Author

johnbelamaric commented Oct 18, 2019

squashed

Member

wojtek-t commented Oct 19, 2019

Let's merge and iterated.

/lgtm

k8s-ci-robot assigned wojtek-t

k8s-ci-robot added the lgtm label

k8s-ci-robot merged commit eeec091 into kubernetes:master

deads2k reviewed

View reviewed changes

sig-architecture/production-readiness.md

+              ## Questionnaire
+              * Feature enablement and rollback
+                - How can this feature be enabled / disabled in a live cluster?

Contributor

deads2k Nov 6, 2019

I think this should be clarified to be a live-HA cluster.

deads2k reviewed

View reviewed changes

sig-architecture/production-readiness.md

+                  of a node?
+                - What happens if a cluster with this feature enabled is rolled back? What
+                  happens if it is subsequently upgraded again?
+                - Are there tests for this?

Contributor

deads2k Nov 6, 2019

Clarify "this". I suspect you mean, "are there tests for a disable, enable, disable, enable cycle", but you could also mean "upgrade, downgrade, upgrade" which seems pretty onerous at the moment.

deads2k reviewed

View reviewed changes

sig-architecture/production-readiness.md

+              * Dependencies
+                - Does this feature depend on any specific services running in the cluster
+                  (e.g., a metrics service)?
+                - How does this feature respond to complete failures of the services on

Contributor

deads2k Nov 6, 2019

I would be slightly more prescriptive here. "how would a cluster-admin know that this feature is failing because a particular service is degraded" It could be two questions, but when I'm deploying, I want to know how to tell it's failing.

deads2k reviewed

View reviewed changes

sig-architecture/production-readiness.md

+                - How does this feature respond to degraded performance or high error rates
+                  from services on which it depends?
+              * Monitoring requirements
+                - How can an operator determine if the feature is in use by workloads?

Contributor

deads2k Nov 6, 2019

do we specifically care about workloads or just "in use"?

deads2k reviewed

View reviewed changes

Contributor

deads2k left a comment

sigh, github.

sig-architecture/production-readiness.md

+                  which it depends?
+                - How does this feature respond to degraded performance or high error rates
+                  from services on which it depends?
+              * Monitoring requirements

Contributor

deads2k Nov 6, 2019

I'd like to be slightly more prescriptive here. I want to ensure that any new binary comes with a secured health, ready, and metrics endpoint.

ericavonb reviewed

View reviewed changes

sig-architecture/production-readiness.md

+              * Feature enablement and rollback
+                - How can this feature be enabled / disabled in a live cluster?
+                - Can the feature be disabled once it has been enabled (i.e., can we roll

ericavonb Nov 6, 2019

It might be good to include impact on workloads as well, distinct from the control-plane/cluster-level considerations.
Like, some workload considerations might be:

Does this feature change the behavior or performance characteristics of workloads running on a cluster?
Will some workloads that could run successfully on the cluster before, stop working or no longer be admissible once this feature is enabled?
Do workloads need to be restarted to take advantage of this feature?
How can workloads be migrated over to take advantage of this feature? Can it be selectively enabled (e.g. per-node/per-namespace, only to new workloads/objects, in a report-only or dry-run mode)? Will enabling/disabling the feature require downtime or make certain features temporarily unavailable for workloads running on the cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

wojtek-t wojtek-t left review comments

deads2k deads2k left review comments

ericavonb ericavonb left review comments

derekwaynecarr Awaiting requested review from derekwaynecarr

dims Awaiting requested review from dims

Labels

approved area/developer-guide cncf-cla: yes lgtm sig/architecture size/M