You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- `cluster validate ResourceClaimTemplate and ResourceClaim for admin access`, [SIG Node](https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kubernetes-kind-dra-all), [triage search](https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=admin%20access)
451
+
- `cluster DaemonSet with admin access`, [SIG Node](https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#pull-kubernetes-kind-dra-all), [triage search](https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=admin%20access)
457
452
458
453
### Graduation Criteria
459
454
@@ -464,13 +459,23 @@ ResourceClaimTemplate and ResourceClaim for admin access
464
459
465
460
#### Beta
466
461
467
-
- Gather feedback
462
+
- Gather feedback from developers and surveys via implementations in the kubernetes-sigs/dra-example-driver: https://github.com/kubernetes-sigs/dra-example-driver/issues/97 and potentially other drivers
463
+
- Complete feature AdminAccess
468
464
- Additional tests are in Testgrid and linked in KEP
469
-
- Implementations in the kubernetes-sigs/dra-example-driver
465
+
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
466
+
- All functionality completed
467
+
- All security enforcement completed
468
+
- All monitoring requirements completed
469
+
- All testing requirements completed
470
+
- All known pre-release issues and gaps resolved
471
+
**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified
470
472
471
-
#### GA
472
473
474
+
#### GA
475
+
- 1 example of real-world usage
473
476
- Allowing time for feedback
477
+
- All issues and gaps identified as feedback during beta are resolved
478
+
**Note:** GA criteria must not include any functional, security, monitoring, or testing requirements. Those must be beta requirements.
474
479
475
480
### Upgrade / Downgrade Strategy
476
481
@@ -541,7 +546,12 @@ rollout. Similarly, consider large clusters and how enablement/disablement
541
546
will rollout across nodes.
542
547
-->
543
548
544
-
Will be considered for beta.
549
+
- kube-controller-manager: If the kube-controller-manager fails to create `ResourceClaim` objects from `ResourceClaimTemplate` due to misconfigurations or permission issues relating to `adminAccess`, then the associated Pods will remain in a pending state and won't be scheduled.
550
+
- kube-scheduler: Bugs in the scheduler might lead to Pods not being scheduled even when resources are available or, scheduling Pods that shouldn't be scheduled due to unmet `adminAccess` requirements, all this should be part of the generic scheduler backoff behavior. It will not affect running workloads.
551
+
- Workloads Without `ResourceClaims` will remain unaffected as the adminAccess feature doesn't interact with them. The new code paths introduced for adminAccess only engage when `ResourceClaims` are present in the Pod specification.
552
+
- New Pods requiring `ResourceClaims` with `adminAccess` might remain unscheduled if the control plane components fail to process the claims correctly.
553
+
- Existing Pods continue to run unaffected since `ResourceClaim` and `ResourceClaimTemplate`'s spec is immutable, including the adminAccess field, cannot be altered.
554
+
545
555
546
556
###### What specific metrics should inform a rollback?
547
557
@@ -557,8 +567,6 @@ the `scheduler_pending_pods` metric in the kube-scheduler or an increase in the
557
567
Further analysis by reviewing logs and pod events is needed to determine whether
558
568
errors are related to this feature.
559
569
560
-
Will provide more details for beta.
561
-
562
570
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
563
571
564
572
<!--
@@ -567,15 +575,19 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
567
575
are missing a bunch of machinery and tooling and can't do that now.
568
576
-->
569
577
570
-
Will be considered for beta.
578
+
This will be done manually before transition to beta by bringing up a cluster with kubeadm and changing the feature gate for individual components.
579
+
580
+
Manual upgrade of the control plane to a version with the feature enabled will be tested. Existing pods not using the feature remained running. Creation of new pods and ResourceClaims that do not use the feature should be unaffected.
581
+
582
+
Manual downgrade of the control plane to a version with the feature disabled was tested. Existing pods using the feature remained running. Creation of new pods and ResourceClaims that use the feature should be blocked.
571
583
572
584
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
573
585
574
586
<!--
575
587
Even if applying deprecation policies, they may still surprise some users.
576
588
-->
577
589
578
-
Will be considered for beta.
590
+
No.
579
591
580
592
### Monitoring Requirements
581
593
@@ -586,7 +598,7 @@ For GA, this section is required: approvers should be able to confirm the
586
598
previous answers based on experience in the field.
587
599
-->
588
600
589
-
Will be considered for beta.
601
+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
590
602
591
603
###### How can an operator determine if the feature is in use by workloads?
592
604
@@ -596,7 +608,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
596
608
logs or events for this purpose.
597
609
-->
598
610
599
-
Will be considered for beta.
611
+
".status.allocation.devices.results[*].adminAccess"will be set to true for a claim using adminAccess when needed by a pod.
612
+
613
+
Metrics in kube-controller-manager about total (resourceclaim_controller_resource_claims_adminaccess) and allocated ResourceClaims with adminAccess (resourceclaim_controller_allocated_resource_claims_adminaccess).
600
614
601
615
###### How can someone using this feature know that it is working for their instance?
602
616
@@ -640,7 +654,7 @@ These goals will help you determine what you need to measure (SLIs) in the next
640
654
question.
641
655
-->
642
656
643
-
Will be considered for beta.
657
+
SLO: 100% of unauthorized access attempts are denied.
644
658
645
659
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
646
660
@@ -673,14 +687,17 @@ metric in scheduler will identify pods that are currently unschedulable because
673
687
of the `DynamicResources` plugin or a misconfiguration of the `AdminAccess`
674
688
field.
675
689
690
+
Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
691
+
field.
692
+
676
693
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
677
694
678
695
<!--
679
696
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
680
697
implementation difficulties, etc.).
681
698
-->
682
699
683
-
Will be considered for beta.
700
+
No
684
701
685
702
### Dependencies
686
703
@@ -705,7 +722,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
705
722
- Impact of its degraded performance or high-error rates on the feature:
706
723
-->
707
724
708
-
Will be considered for beta.
725
+
- The DynamicResourceAllocation feature gate must be enabled to create ResourceClaim, ResourceClaimTemplate. More details at [KEP-4381 - DRA Structured Parameters](https://github.com/kubernetes/enhancements/issues/4381)
726
+
- A third-party DRA driver is required for how the driver should interpret the AdminAccess field to get acess to device specific resources without allocating them.
709
727
710
728
### Scalability
711
729
@@ -755,7 +773,7 @@ details). For now, we leave it here.
755
773
756
774
###### How does this feature react if the API server and/or etcd is unavailable?
757
775
758
-
Will be considered for beta.
776
+
The Kubernetes control plane will be down, so no new ResourceClaim or ResourceClaimTemplate will be created.
759
777
760
778
###### What are other known failure modes?
761
779
@@ -772,15 +790,35 @@ For each of them, fill in the following information by copying the below templat
772
790
- Testing: Are there any tests for failure mode? If not, describe why.
773
791
-->
774
792
775
-
Will be considered for beta.
793
+
- kube-scheduler cannot allocate ResourceClaims with AdminAccess.
794
+
795
+
- Detection: When pods fail to get scheduled, kube-scheduler reports that
796
+
through events and pod status. For DRA, messages include "cannot allocate
797
+
all claims" (insufficient resources) and "ResourceClaim not created yet"
798
+
(user or kube-controller-manager haven't created the ResourceClaim yet).
metric will have pods counted under the "dynamicresources" plugin label.
802
+
803
+
To troubleshoot, "kubectl describe" can be used on (in this order) Pod
804
+
and ResourceClaim.
805
+
806
+
- Mitigations: When ResourceClaims or ResourceClaimTemplates with the `AdminAccess`
807
+
field don't get created, debugging should focus on the namespace labels. The kube-controller-manager logs should have more information.
808
+
809
+
- Diagnostics: Audit Policy can be created to ensure all create operations on ResourceClaim, ResourceClaimTemplate, and Namespace resources are logged at the metadata level to review successful and denied attempts to set the `AdminAccess`
810
+
field.
811
+
812
+
- Testing: E2E testing covers scenarios that successfully created ResourceClaims and ResourceClaimTemplates with the `AdminAccess` field in admin namespace and denied attempts in non-admin namespace.
776
813
777
814
###### What steps should be taken if SLOs are not being met to determine the problem?
778
815
779
-
Will be considered for beta.
816
+
If SLOs are not being met, not all 100% of unauthorized access attempts are denied. Debugging to determine the problem should review the namespace labels to verify correctness.
0 commit comments