You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/3751-volume-attributes-class/README.md
+69-26Lines changed: 69 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,15 +82,15 @@
82
82
Items marked with (R) are required *prior to targeting to a milestone / release*.
83
83
84
84
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
85
-
-[] (R) KEP approvers have approved the KEP status as `implementable`
85
+
-[X] (R) KEP approvers have approved the KEP status as `implementable`
86
86
-[X] (R) Design details are appropriately documented
87
-
-[] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
88
-
-[] e2e Tests for all Beta API Operations (endpoints)
89
-
-[] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
90
-
-[] (R) Minimum Two Week Window for GA e2e tests to prove flake free
91
-
-[] (R) Graduation criteria is in place
87
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
88
+
-[X] e2e Tests for all Beta API Operations (endpoints) - [dashboard](https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-kind-beta-features&width=90&include-filter-by-regex=VolumeAttributesClass)
89
+
-[X] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
90
+
-[X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
91
+
-[X] (R) Graduation criteria is in place
92
92
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
93
-
-[] (R) Production readiness review completed
93
+
-[X] (R) Production readiness review completed
94
94
-[ ] (R) Production readiness review approved
95
95
-[X] "Implementation History" section is up-to-date for milestone
96
96
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
@@ -240,9 +240,9 @@ The CSI create request will be extended to add mutable parameters. A new Control
240
240
241
241
#### Default VolumeAttributesClass
242
242
243
-
A default VolumeAttributesClass can be specified for the Kubernetes cluster. This default VolumeAttributesClass is then used to dynamically provision storage for PersistentVolumeClaims that do not require any specific VolumeAttributesClass. A cluster admin can use annotation to manage default VolumeAttributesClass. The default VolumeAttributesClass has an annotation volumeattributesclass.kubernetes.io/is-default-class set to true. Any other value or absence of the annotation is interpreted as false.
243
+
For GA, the VolumeAttributesClass feature does not support a default VolumeAttributesClass. This is because there is already a natural default for VolumeAttributesClass: no VolumeAttributesClass associated with the PersistentVolumeClaim. Furthermore, with a default, there would be added overhead for cluster operators in making sure a cluster's default StorageClass and default VolumeAttributesClass are compatible.
244
244
245
-
Note: For Kubernetes versions ≤ v1.31, the VolumeAttributesClass feature does not support a default VolumeAttributesClass. This is because there is already a natural default for VolumeAttributesClass: no VolumeAttributesClass associated with the PersistentVolumeClaim. Furthermore, with a default, there would be added overhead for cluster operators in making sure a cluster's default StorageClass and default VolumeAttributesClass are compatible. Use-cases and support for Default VolumeAttributesClass will be re-evaluated during this feature's beta in Kubernetes v1.31.
245
+
For future design, a default VolumeAttributesClass can be specified for the Kubernetes cluster. This default VolumeAttributesClass is then used to dynamically provision storage for PersistentVolumeClaims that do not require any specific VolumeAttributesClass.
246
246
247
247
#### Pre-provisioned Volume
248
248
@@ -695,10 +695,10 @@ VolumeAttributesClass parameters can be considered as best-effort parameters, th
695
695
696
696
* Basic unit tests for performance and quota system.
697
697
* API conformance tests
698
-
* E2E tests with happy tests in the [K8s storage framework](https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/testsuites) for different drivers testing
699
-
* E2E tests using mock driver to cause failure on create, update and recovering cases
- VAC protection controller with large lists of PVCs (2000)
753
-
- Creating a large amount of PVCs (2000) using the same VolumeAttributesClass
752
+
- VAC protection controller with large lists of PVCs (500)
753
+
- Creating a large amount of PVCs (500) using the same VolumeAttributesClass
754
+
755
+
Stress test by EBS CSI Driver:
756
+
757
+
Scale concurrently modifying 500 volumes via VAC. Patched 5 PVCs per second with new VAC and waited for all volumes to modify.
758
+
759
+
Tested against resizer built from kubernetes-csi/external-resizer#487 and EBS CSI Driver v1.44 on EKS 1.33. Used aws-ebs-csi-driver/hack/ebs-scale-test modification test.
760
+
761
+
Resizer CPU peaked at 0.33 cores and Mem at 43 Mb
762
+
763
+
Additional metrics and in gist: https://gist.github.com/AndrewSirenko/24ab0e9b3e66d279b3406e4e26264835
@@ -767,7 +779,15 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
767
779
768
780
- Beta in 1.31: Since this feature is an extension of the external-resizer/external-provisioner usage flow, we are going to move this to beta with enhanced e2e and test coverage. Test cases are covered in sessions above: ``e2e tests``, ``Integration tests`` etc. Controllers will handle VolumeAttributesClass feature gates being on by default, but beta API itself being disabled on cluster by default.
769
781
- Involve 3 different CSI drivers to participate in testing
770
-
- Stress test before GA
782
+
- Rollback and stress test before GA
783
+
- All functionality completed
784
+
- All security enforcement completed
785
+
- All monitoring requirements completed
786
+
- All testing requirements completed
787
+
- All known pre-release issues and gaps resolved
788
+
- Resource quota with scope implementation soaked in 1.33 release
789
+
- Added rollback support based on feedbacks
790
+
- Bug fix for [event emission of non exist VAC](https://github.com/kubernetes-csi/external-resizer/issues/427)
771
791
772
792
#### GA
773
793
@@ -837,14 +857,37 @@ If the feature is rolled out partially on API servers, there will be no impact o
837
857
be processed as if the feature is disabled, the external-provisioner/external-resizer is not acting on the event created yet - that means nothing happens and PVC
838
858
will not be changed with the iops/throughput until external-provisioner/external-resizer is deployed.
839
859
860
+
###### How can a rollout or rollback fail? Can it impact already running workloads?
861
+
862
+
In general, rollout / rollback should not fail since the feature needs to be explicitly set in the PVC.
840
863
841
864
###### What specific metrics should inform a rollback?
842
865
843
866
A metric `controller_modify_volume_errors_total` will indicate a problem with the feature.
844
867
845
868
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
846
869
847
-
TODO Upgrade and rollback will be tested when the feature gate will change to beta.
870
+
Tested in Beta:
871
+
872
+
1. Enable both feature flag and beta API in api-server, create PVC with VAC1, and then modify to VAC2
873
+
2. Turn off the feature flag first, and then try to modify PVC back to VAC1, got error:
874
+
875
+
```
876
+
The PersistentVolumeClaim "test-pvc" is invalid: spec.volumeAttributesClassName: Forbidden: update
877
+
is forbidden when the VolumeAttributesClass feature gate is disabled
878
+
```
879
+
The pod and volume are both up and running.
880
+
881
+
3. Turn off the beta API, this time ``kubectl get vac`` got error:
882
+
```
883
+
Error from server (NotFound): Unable to list "storage.k8s.io/v1beta1, Resource=volumeattributesclasses":
884
+
the server could not find the requested resource
885
+
```
886
+
887
+
The pod and volume are both up and running.
888
+
889
+
4. Turn on both feature flag and beta API in api-server again. ``kubectl get vac`` shows the VACs again. Change PVC back to VAC1, modify is applied.
890
+
848
891
849
892
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
850
893
@@ -923,7 +966,7 @@ previous answers based on experience in the field.
923
966
924
967
###### Will enabling / using this feature result in any new API calls?
925
968
926
-
Yes. The VAC protection controller will be expensive because it needs to LIST PVCs/PVs but the call volume should be low.
969
+
Yes. The VAC protection controller will be expensive because it needs to LIST PVCs/PVs but the call rate should be low because it is user triggered by changing VAC in the PVC.
927
970
928
971
- API call type: PATCH PVC
929
972
- estimated throughput: low, only once for PVCs that have
@@ -954,7 +997,7 @@ Using this feature may result in non-negligible increase of resource usage IF cu
954
997
- external-resizer CPU and memory will see a non-negligible increase if users increased the number of concurrent operations via the `--workers` flag. We follow the strategy of sharing that limit between `ControllerExpandVolume` and `ControllerModifyVolume` RPCs, similar to how external-provisioner functions.
955
998
- The API-Server may see a spike of CPU when processing relevant changes.
956
999
957
-
Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.
1000
+
Stress tests will determine increase in resource usage at varying amounts of concurrent volume modifications.
0 commit comments