Skip to content

Commit fc234c1

Browse files
authored
Merge pull request #5327 from ritazh/kep-5018-beta
KEP-5018: move to beta in 1.34
2 parents 54d8e48 + f059494 commit fc234c1

File tree

3 files changed

+88
-48
lines changed

3 files changed

+88
-48
lines changed

keps/prod-readiness/sig-auth/5018.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
kep-number: 5018
55
alpha:
66
approver: "soltysh"
7+
beta:
8+
approver: "soltysh"

keps/sig-auth/5018-dra-adminaccess/README.md

Lines changed: 84 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,13 @@
4444
Items marked with (R) are required _prior to targeting to a milestone /
4545
release_.
4646

47-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in
47+
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in
4848
[kubernetes/enhancements] (not the initial KEP PR)
49-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
49+
- [x] (R) KEP approvers have approved the KEP status as `implementable`
5050
- [x] (R) Design details are appropriately documented
51-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and
51+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and
5252
SIG Testing input (including test refactors)
53-
- [ ] e2e Tests for all Beta API Operations (endpoints)
53+
- [x] e2e Tests for all Beta API Operations (endpoints)
5454
- [ ] (R) Ensure GA e2e tests meet requirements for
5555
[Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5656
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
@@ -59,12 +59,12 @@ release_.
5959
[all GA Endpoints](https://github.com/kubernetes/community/pull/1806)
6060
must be hit by
6161
[Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
62-
- [ ] (R) Production readiness review completed
63-
- [ ] (R) Production readiness review approved
62+
- [x] (R) Production readiness review completed
63+
- [x] (R) Production readiness review approved
6464
- [x] "Implementation History" section is up-to-date for milestone
65-
- [ ] User-facing documentation has been created in [kubernetes/website], for
65+
- [x] User-facing documentation has been created in [kubernetes/website], for
6666
publication to [kubernetes.io]
67-
- [ ] Supporting documentation—e.g., additional design documents, links to
67+
- [x] Supporting documentation—e.g., additional design documents, links to
6868
mailing list discussions/SIG meetings, relevant PRs/issues, release notes
6969

7070
<!--
@@ -179,9 +179,11 @@ objects as privileged. This feature includes:
179179
```yaml
180180
metadata:
181181
labels:
182-
resource.k8s.io/admin-access: "true"
182+
resource.kubernetes.io/admin-access: "true"
183183
```
184184
185+
Note: This label has been updated from `resource.k8s.io/admin-access` while the feature was in alpha.
186+
185187
Assumptions:
186188

187189
- It is not important to subdivide admin access to different types of
@@ -194,7 +196,7 @@ objects as privileged. This feature includes:
194196

195197
In the REST storage layer, validate requests to create and update
196198
`ResourceClaim` or `ResourceClaimTemplate` objects with `adminAccess: true`.
197-
Only authorize if namespace has the `resource.k8s.io/admin-access: "true"` label.
199+
Only authorize if namespace has the `resource.kubernetes.io/admin-access: "true"` label.
198200
199201
1. Grants privileged access to the requested device:
200202
@@ -212,7 +214,7 @@ objects as privileged. This feature includes:
212214
### Workflow
213215

214216
1. A cluster administrator labels an admin namespace with
215-
`resource.k8s.io/admin-access: "true"`.
217+
`resource.kubernetes.io/admin-access: "true"`.
216218
217219
1. Users who are authorized to create `ResourceClaim` or `ResourceClaimTemplate`
218220
objects in this admin namespace can set `adminAccess: true` field if they
@@ -284,7 +286,7 @@ shouldn't have allowed unrestricted access.
284286
Starting in Kubernetes 1.33 (when this KEP was introduced), a validation has
285287
been added to the REST storage layer to only authorize `ResourceClaim` or
286288
`ResourceClaimTemplate` with `adminAccess: true` requests if their namespace has
287-
the `resource.k8s.io/admin-access: "true"` label to only allow it for users with
289+
the `resource.kubernetes.io/admin-access: "true"` label to only allow it for users with
288290
additional privileges.
289291
290292
The below flowchart starts with `ResourceClaim` creation from
@@ -403,19 +405,15 @@ The scheduler plugin and resource claim controller are covered by the workloads
403405
in
404406
https://github.com/kubernetes/kubernetes/blob/master/test/integration/scheduler_perf/dra/performance-config.yaml
405407

406-
Those tests run in:
408+
Additional test cases have been added to `test/integration/scheduler_perf` to
409+
ensure `ResourceClaim` or `ResourceClaimTemplate` with `adminAccess: true`
410+
requests are only authorized if their namespace has the
411+
`resource.kubernetes.io/admin-access: "true"` label as described in this KEP.
412+
413+
These tests run as part of the following with the `DRAAdminAccess` feature gate enabled.
407414

408-
- [pre-submit](https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration)
409-
and
410-
[periodic](https://testgrid.k8s.io/sig-release-master-blocking#integration-master)
411-
integration testing under
412-
`k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf` and
413-
`k8s.io/kubernetes/test/integration/scheduler_perf.dra.dra` and the
414-
`DRAAdminAccess` feature gate is already enabled.
415-
- Additional test cases will be added to `test/integration/scheduler_perf` to
416-
ensure `ResourceClaim` or `ResourceClaimTemplate` with `adminAccess: true`
417-
requests are only authorized if their namespace has the
418-
`resource.k8s.io/admin-access: "true"` label as described in this KEP.
415+
- `k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf`: [pre-submit](https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration&include-filter-by-regex=scheduler_perf.scheduler_perf), [periodic](https://testgrid.k8s.io/sig-release-master-blocking#integration-master&include-filter-by-regex=scheduler_perf.scheduler_perf), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=scheduler_perf)
416+
- `k8s.io/kubernetes/test/integration/scheduler_perf.dra.dra`: [pre-submit](https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-integration&include-filter-by-regex=scheduler_perf.dra.dra),[periodic](https://testgrid.k8s.io/sig-release-master-blocking#integration-master&include-filter-by-regex=scheduler_perf.dra.dra), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=scheduler_perf)
419417

420418
##### e2e tests
421419

@@ -436,7 +434,7 @@ was developed as part of the overall DRA development effort. We have extended
436434
this test driver to enable `DRAAdminAccess` feature gate and added tests to
437435
ensure `ResourceClaim` or `ResourceClaimTemplate` with `adminAccess: true`
438436
requests are only authorized if their namespace has the
439-
`resource.k8s.io/admin-access: "true"` label as described in this KEP.
437+
`resource.kubernetes.io/admin-access: "true"` label as described in this KEP.
440438
441439
Test links:
442440
@@ -449,11 +447,8 @@ ResourceClaimTemplate and ResourceClaim for admin access
449447
[Feature:DRAAdminAccess] [FeatureGate:DRAAdminAccess] [Alpha]
450448
[FeatureGate:DynamicResourceAllocation] [Beta]
451449
452-
- AdminAccess related tests in
453-
https://github.com/kubernetes/kubernetes/blob/69ab91a5c59617872c9f48737c64409a9dec2957/test/e2e/dra/dra.go#L976
454-
and
455-
https://github.com/kubernetes/kubernetes/blob/69ab91a5c59617872c9f48737c64409a9dec2957/test/e2e/dra/dra.go#L1095
456-
will be updated.
450+
- `cluster validate ResourceClaimTemplate and ResourceClaim for admin access`, [SIG Node](https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kubernetes-kind-dra-all), [triage search](https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=admin%20access)
451+
- `cluster DaemonSet with admin access`, [SIG Node](https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kubernetes-kind-dra-all), [triage search](https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=admin%20access)
457452

458453
### Graduation Criteria
459454

@@ -464,13 +459,23 @@ ResourceClaimTemplate and ResourceClaim for admin access
464459

465460
#### Beta
466461

467-
- Gather feedback
462+
- Gather feedback from developers and surveys via implementations in the kubernetes-sigs/dra-example-driver: https://github.com/kubernetes-sigs/dra-example-driver/issues/97 and potentially other drivers
463+
- Complete feature AdminAccess
468464
- Additional tests are in Testgrid and linked in KEP
469-
- Implementations in the kubernetes-sigs/dra-example-driver
465+
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
466+
- All functionality completed
467+
- All security enforcement completed
468+
- All monitoring requirements completed
469+
- All testing requirements completed
470+
- All known pre-release issues and gaps resolved
471+
**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified
470472

471-
#### GA
472473

474+
#### GA
475+
- 1 example of real-world usage
473476
- Allowing time for feedback
477+
- All issues and gaps identified as feedback during beta are resolved
478+
**Note:** GA criteria must not include any functional, security, monitoring, or testing requirements. Those must be beta requirements.
474479

475480
### Upgrade / Downgrade Strategy
476481

@@ -541,7 +546,12 @@ rollout. Similarly, consider large clusters and how enablement/disablement
541546
will rollout across nodes.
542547
-->
543548

544-
Will be considered for beta.
549+
- kube-controller-manager: If the kube-controller-manager fails to create `ResourceClaim` objects from `ResourceClaimTemplate` due to misconfigurations or permission issues relating to `adminAccess`, then the associated Pods will remain in a pending state and won't be scheduled.
550+
- kube-scheduler: Bugs in the scheduler might lead to Pods not being scheduled even when resources are available or, scheduling Pods that shouldn't be scheduled due to unmet `adminAccess` requirements, all this should be part of the generic scheduler backoff behavior. It will not affect running workloads.
551+
- Workloads Without `ResourceClaims` will remain unaffected as the adminAccess feature doesn't interact with them. The new code paths introduced for adminAccess only engage when `ResourceClaims` are present in the Pod specification.
552+
- New Pods requiring `ResourceClaims` with `adminAccess` might remain unscheduled if the control plane components fail to process the claims correctly.
553+
- Existing Pods continue to run unaffected since `ResourceClaim` and `ResourceClaimTemplate`'s spec is immutable, including the adminAccess field, cannot be altered.
554+
545555

546556
###### What specific metrics should inform a rollback?
547557

@@ -557,8 +567,6 @@ the `scheduler_pending_pods` metric in the kube-scheduler or an increase in the
557567
Further analysis by reviewing logs and pod events is needed to determine whether
558568
errors are related to this feature.
559569

560-
Will provide more details for beta.
561-
562570
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
563571

564572
<!--
@@ -567,15 +575,19 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
567575
are missing a bunch of machinery and tooling and can't do that now.
568576
-->
569577

570-
Will be considered for beta.
578+
This will be done manually before transition to beta by bringing up a cluster with kubeadm and changing the feature gate for individual components.
579+
580+
Manual upgrade of the control plane to a version with the feature enabled will be tested. Existing pods not using the feature remained running. Creation of new pods and ResourceClaims that do not use the feature should be unaffected.
581+
582+
Manual downgrade of the control plane to a version with the feature disabled was tested. Existing pods using the feature remained running. Creation of new pods and ResourceClaims that use the feature should be blocked.
571583

572584
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
573585

574586
<!--
575587
Even if applying deprecation policies, they may still surprise some users.
576588
-->
577589

578-
Will be considered for beta.
590+
No.
579591

580592
### Monitoring Requirements
581593

@@ -586,7 +598,7 @@ For GA, this section is required: approvers should be able to confirm the
586598
previous answers based on experience in the field.
587599
-->
588600

589-
Will be considered for beta.
601+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
590602

591603
###### How can an operator determine if the feature is in use by workloads?
592604

@@ -596,7 +608,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
596608
logs or events for this purpose.
597609
-->
598610

599-
Will be considered for beta.
611+
".status.allocation.devices.results[*].adminAccess" will be set to true for a claim using adminAccess when needed by a pod.
612+
613+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
600614

601615
###### How can someone using this feature know that it is working for their instance?
602616

@@ -640,7 +654,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
640654
question.
641655
-->
642656

643-
Will be considered for beta.
657+
SLO: 100% of unauthorized access attempts are denied.
644658

645659
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
646660

@@ -673,14 +687,17 @@ metric in scheduler will identify pods that are currently unschedulable because
673687
of the `DynamicResources` plugin or a misconfiguration of the `AdminAccess`
674688
field.
675689

690+
Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
691+
field.
692+
676693
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
677694

678695
<!--
679696
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
680697
implementation difficulties, etc.).
681698
-->
682699

683-
Will be considered for beta.
700+
No
684701

685702
### Dependencies
686703

@@ -705,7 +722,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
705722
- Impact of its degraded performance or high-error rates on the feature:
706723
-->
707724

708-
Will be considered for beta.
725+
- The DynamicResourceAllocation feature gate must be enabled to create ResourceClaim, ResourceClaimTemplate. More details at [KEP-4381 - DRA Structured Parameters](https://github.com/kubernetes/enhancements/issues/4381)
726+
- A third-party DRA driver is required for how the driver should interpret the AdminAccess field to get acess to device specific resources without allocating them.
709727

710728
### Scalability
711729

@@ -755,7 +773,7 @@ details). For now, we leave it here.
755773

756774
###### How does this feature react if the API server and/or etcd is unavailable?
757775

758-
Will be considered for beta.
776+
The Kubernetes control plane will be down, so no new ResourceClaim or ResourceClaimTemplate will be created.
759777

760778
###### What are other known failure modes?
761779

@@ -772,15 +790,35 @@ For each of them, fill in the following information by copying the below templat
772790
- Testing: Are there any tests for failure mode? If not, describe why.
773791
-->
774792

775-
Will be considered for beta.
793+
- kube-scheduler cannot allocate ResourceClaims with AdminAccess.
794+
795+
- Detection: When pods fail to get scheduled, kube-scheduler reports that
796+
through events and pod status. For DRA, messages include "cannot allocate
797+
all claims" (insufficient resources) and "ResourceClaim not created yet"
798+
(user or kube-controller-manager haven't created the ResourceClaim yet).
799+
The
800+
["unschedulable_pods"](https://github.com/kubernetes/kubernetes/blob/9fca4ec44afad4775c877971036b436eef1a1759/pkg/scheduler/metrics/metrics.go#L200-L206)
801+
metric will have pods counted under the "dynamicresources" plugin label.
802+
803+
To troubleshoot, "kubectl describe" can be used on (in this order) Pod
804+
and ResourceClaim.
805+
806+
- Mitigations: When ResourceClaims or ResourceClaimTemplates with the `AdminAccess`
807+
field don't get created, debugging should focus on the namespace labels. The kube-controller-manager logs should have more information.
808+
809+
- Diagnostics: Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
810+
field.
811+
812+
- Testing: E2E testing covers scenarios that successfully created ResourceClaims and ResourceClaimTemplates with the `AdminAccess` field in admin namespace and denied attempts in non-admin namespace.
776813

777814
###### What steps should be taken if SLOs are not being met to determine the problem?
778815

779-
Will be considered for beta.
816+
If SLOs are not being met, not all 100% of unauthorized access attempts are denied. Debugging to determine the problem should review the namespace labels to verify correctness.
780817

781818
## Implementation History
782819

783820
- Kubernetes 1.33: Alpha version of the KEP.
821+
- Kubernetes 1.34: Beta version of the KEP.
784822

785823
## Drawbacks
786824

keps/sig-auth/5018-dra-adminaccess/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ see-also:
1717
- "/keps/sig-node/4381-dra-structured-parameters"
1818

1919
# The target maturity stage in the current dev cycle for this KEP.
20-
stage: alpha
20+
stage: beta
2121

2222
# The most recent milestone for which work toward delivery of this KEP has been
2323
# done. This can be the current (upcoming) milestone, if it is being actively
2424
# worked on.
25-
latest-milestone: "v1.33"
25+
latest-milestone: "v1.34"
2626

2727
# The milestone at which this feature was, or is targeted to be, at each stage.
2828
milestone:

0 commit comments

Comments
 (0)