Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3130: kms observability #3133

Merged
merged 1 commit into from
Feb 3, 2022

Conversation

aramase
Copy link
Member

@aramase aramase commented Jan 13, 2022

Signed-off-by: Anish Ramasekar anish.ramasekar@gmail.com

  • One-line PR description: KMS Observability
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 13, 2022
@k8s-ci-robot k8s-ci-robot requested review from deads2k and enj January 13, 2022 08:26
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. labels Jan 13, 2022
@aramase
Copy link
Member Author

aramase commented Jan 13, 2022

/cc @ritazh @enj

@k8s-ci-robot k8s-ci-robot requested a review from ritazh January 13, 2022 08:26

## Motivation

The only way to correlate a successful/failed envelope operation today is to use the approximate timestamp of the operation to check events in kube-apiserver, kms-plugin and KMS. There is no guarantee that the timestamp of the operation is the same as the timestamp of the corresponding event in KMS. This KEP proposes extending the signature of the kms-plugin interface to include the transaction ID (to be generated by the kube-apiserver), which kms-plugin could pass to KMS. This transaction ID will be logged with additional metadata such a secret name and namespace for the envelope operation. Similarly, the transaction ID will be logged in the kms-plugin and optionally passed to KMS.
Copy link
Member

@dgrisonnet dgrisonnet Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to correlate a successful/failed envelope operation today

Stepping further back in the investigation process, there aren't any ways to easily detect envelope operation failures. So on top of wanting this correlation, one other motivation could be to improve the detection of failed operations.

This would allow us to introduce a new goal to this KEP which is to improve monitoring around this kind of incident. The solution as we've discussed in our meetings would be to introduce metrics at the apiserver level to detect failed operations and libraries at the kms provider level to provide a set of metrics in both the Prometheus and the OTel format that any implementation could reuse. Once we have that anyone could write alerts to be informed of failures.

- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to Graduation criteria, since we are going with the option of generating a UID for the Alpha version of this KEP. We could have a criteria to switch to audit ID as a graduation criteria to go Beta if this is the direction we want to go with in the future.
Or we could phrase that in a more generic way such as "integrating with existing audit identification mechanisms".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're still investigating if integration with audit ID is possible but I'm not sure if we want to add that here as graduation criteria before knowing it would be feasible. If we can edit the graduation criteria at a later point then we can add it but if not, we should update it after investigation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to investigating if we can use audit ID, there's also the concern that not all calls to the KMS correspond to an audit log, which would make audit ID not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Comment on lines 196 to 198
###### How does this feature react if the API server and/or etcd is unavailable?

- Secret data encryption with external kms-plugin is unavailable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this seems more appropriate to the KMS feature as a whole but if we talking about KMS observability only, I feel like the following questions would be more appropriate to elaborate on:

  • How does this feature react if the KMS provider / KMS is unavailable?
  • How does this feature react if encrypted resources are unavailable?

To answer that question we could elaborate on how to use the various information introduced by this KEP to troubleshoot these problems.

We considered using the AuditID from the kube-apiserver request that generated the envelope operation. This approach has the following drawbacks:

1. AuditID can be configured by the user with the `Audit-ID` header in the API server request. Multiple requests can be sent to the kube-apiserver with the same Audit-ID.
2. Not all API server requests will generate an envelope operation. The API server caches DEKs and for the DEK that's available in the cache, the kube-apiserver will not generate an envelope operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is an issue. When troubleshooting I don't think we will care about the operations that are not hitting the kms since they shouldn't fail in the first place and they won't have any impact on the kms.

That said, I don't think hitting the cache is worth logging, but if we think it is useful to keep track of how much the cache is hit, then a metric would be more appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is an issue. When troubleshooting I don't think we will care about the operations that are not hitting the kms since they shouldn't fail in the first place and they won't have any impact on the kms.

With using auditID and having audit logging, this is a drawback because there is no corresponding entry in kms for the UID. This is under the alternatives that we considered but not the actual proposal. Hope that clarifies it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, that is completely normal since the kms provider will not be part of this communication. The information that there is no corresponding entry in kms for the UID can be understood as kms can be ruled out from the potential cause of the problem and that the cache needs to be further investigated.

@aramase aramase force-pushed the 3130-kms-observability branch 2 times, most recently from 684910b to 87c0374 Compare January 19, 2022 18:27
Copy link
Member

@enj enj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff

@@ -14,3 +14,9 @@ approvers:
- "@smarterclayton"
stage: alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the same as https://github.com/kubernetes/enhancements/pull/3133/files#diff-3c134424e5ca25c4b9b0c73ea74c09b970b5f1818809d744c5d585bd3ba00d2bR15. Should we change to stable and update PRR when we make it GA in 1.25?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were just making this whole KEP start at GA / stable (because it is a minor change that should be backwards incompatible). We have the pointer field+feature gate as a temporary safe guard in case anyone has issues with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! changed it to stable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were just making this whole KEP start at GA / stable (because it is a minor change that should be backwards incompatible).

@enj Do you mean it should be backwards compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were just making this whole KEP start at GA / stable (because it is a minor change that should be backwards incompatible).

@enj Do you mean it should be backwards compatible?

Yes, typo on my part.

- [x] Other
- Describe the mechanism: The UID field will be added by default to the `EncryptRequest` and `DecryptRequest` structs in the kms-plugin interface. For older Kubernetes versions where UID isn't generated, the UID field will be empty.
- Feature gate
- Feature gate name: `KMSOperationUID` (enabled by default)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the specifics are:

FeatureSpec{
	Default: true,  // enabled by default
	LockToDefault: false, // can be disabled if needed
	PreRelease: featuregate.Deprecated, // gate will dropped in 1.25
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe PreRelease: featuregate.GA is the correct thing? I think this field refers to the feature itself and not the gate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, featuregate.GA seems to refer to the feature, so using that

@aramase aramase requested a review from enj January 21, 2022 19:56
@enj
Copy link
Member

enj commented Jan 24, 2022

/assign @smarterclayton @deads2k

Clayton PTAL at the KEP details. David PTAL at the PRR bits.

FeatureSpec{
Default: true, // enabled by default
LockToDefault: false, // can be disabled if needed
PreRelease: featuregate.GA, // gate will dropped in 1.25
Copy link
Contributor

@deads2k deads2k Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in the KEP why this feature should start in Stable instead of going through a normal feature lifecycle?

Copy link
Member Author

@aramase aramase Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've summarized the discussion in the KEP for why we are starting as stable.

}
```

The introduction of the new UID field is backward compatible and considered a minor change. For this reason the feature gate is enabled by default and we are starting with Stable. The pointer field plus feature gate is added as a temporary safeguard in case there are issues with the inclusion of a new field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being backwards compatibility isn't a sufficient reason to start as stable. All APIs are required to be backwards compatible. I think this should follow a normal feature lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical as well - it's a new field, and it's adding a new concept to a stable API. That's enough to default to normal lifecycle. The change to the behavior surface area of an API is usually the reason we go through normal alpha->stable lifecycle, because once it's done we can't take it back (because stable apis are stable).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical as well - it's a new field, and it's adding a new concept to a stable API.

aside: the KMS gRPC API is one of our perma beta APIs 😭

@kikisdeliveryservice kikisdeliveryservice changed the title kep-3130: kms observability KEP-3130: kms observability Feb 3, 2022
@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 3, 2022

Please squash

Approved from my perspective (changes to kms)

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

Addressing review feedback and suggestions

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

add note for UID and metadata logging

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

Review feedback

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

add prod-readiness yaml

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

add feature gate

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

Review feedback from Mo

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

change state to stable

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

add comment for stable instead of normal lifecycle

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>

change state to alpha

Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
@aramase aramase force-pushed the 3130-kms-observability branch from 76053fe to e183b3a Compare February 3, 2022 04:10
@aramase
Copy link
Member Author

aramase commented Feb 3, 2022

Squashed to single commit!

@enj
Copy link
Member

enj commented Feb 3, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2022

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Other (treat as last resort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest updating this with a way to inspect the response code for secret queries to check for higher error rates on upgrade and elevated rates in general. It doesn't block alpha PRR review, but will come around again for beta.

@deads2k
Copy link
Contributor

deads2k commented Feb 3, 2022

PRR is good for alpha. I have a non-blocking suggestion for a followup.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aramase, deads2k, enj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit e9814d8 into kubernetes:master Feb 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 3, 2022
@ritazh ritazh mentioned this pull request May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants