Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

never delete PVs while the PVC exists #507

Closed
pohly opened this issue Oct 12, 2020 · 23 comments · Fixed by kubernetes/kubernetes#95909
Closed

never delete PVs while the PVC exists #507

pohly opened this issue Oct 12, 2020 · 23 comments · Fixed by kubernetes/kubernetes#95909

Comments

@pohly
Copy link
Contributor

pohly commented Oct 12, 2020

When doing performance testing, the following was observed:

  • the CSI driver created PVs faster than the volume controller could bind them to the PVCs that triggered provisioning
  • at some point, external-provisioner started deleting those PVs and then re-created them because of the PVCs

The external-provisioner should never delete PVs that it just created.

@pohly
Copy link
Contributor Author

pohly commented Oct 12, 2020

A new PV from external-provisioner has:

metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: pmem-csi.intel.com
  creationTimestamp: "2020-10-12T12:53:12Z"
  finalizers:
  - kubernetes.io/pv-protection
...
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 4Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pmem-csi-pvc-ext4
    namespace: default
    resourceVersion: "40083"
    uid: 48c22dce-9577-424d-9a76-5acc8351e7dd
  csi:
    driver: pmem-csi.intel.com
    fsType: ext4
    volumeAttributes:
      eraseafter: "true"
      name: pvc-48c22dce-9577-424d-9a76-5acc8351e7dd
      storage.kubernetes.io/csiProvisionerIdentity: 1602506219672-8081-pmem-csi.intel.com
    volumeHandle: pvc-48-0e580796bb56418b3eb076d7c80b49de6a2e1edee84e181c1a2fa909
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: pmem-csi.intel.com/node
          operator: In
          values:
          - pmem-csi-pmem-govm-worker2
  persistentVolumeReclaimPolicy: Delete
  storageClassName: pmem-csi-sc-ext4
  volumeMode: Filesystem
status:
  phase: Pending

Once it is bound, the status changes to Phase: "Bound".

So at first glance it looks like external-provisioner can and should check the phase.

@jsafrane
Copy link
Contributor

I'd like to know why PV controller marks the PV for deletion first. Because it can delete PVs provisioned by in-tree plugins too. Can you create a small reproducer, for example with hostpath or mock drivers?

@pohly
Copy link
Contributor Author

pohly commented Oct 13, 2020

You can use this modified clusterload2 test: kubernetes/perf-tests#1530

It should be possible to run this with the hostpath driver (not tested, though).

The invocation that I used to create 5000 volumes on a 50 node cluster was:

cd perf-test/clusterload2
go run cmd/clusterloader.go -v=5 --report-dir=results --kubeconfig=... --provider=local --nodes=50 --testconfig=testing/experimental/storage/pod-startup/config.yaml --testoverrides=testing/experimental/storage/pod-startup/volume-types/persistentvolume/override.yaml --testoverrides=testing/experimental/storage/pod-startup/volume_creation/override.yaml --testoverrides=pmem-csi.yaml --testoverrides=my-config,yaml

With pmem-csi.yaml :

USE_CSI: true
CSI_STORAGE_CLASS: pmem-csi-sc-ext4
VOL_SIZE: 1Mi

my-config,yaml:

HAVE_METRICS: false

# To fill up the entire cluster, we have to create 100 volumes per node with 10Gi each
PODS_PER_NODE: 100
VOL_SIZE: 10Gi

# Create/delete items (not just pods) at this rate.
POD_THROUGHPUT: 10000

# Avoid depending on pv controller by setting the volume.beta.kubernetes.io/storage-provisioner annotation ourselves.
PROVISIONER: pmem-csi.intel.com

For hostpath, you need to change PROVISIONER and CSI_STORAGE_CLASS. Volume size should be irrelevant.

You can start with a much smaller number of volumes by lowering the -nodes parameter and/or PODS_PER_NODE.

What I observed was:

  • PVC creation was relatively fast.
  • PVs also got created quickly.
  • But then kubectl get pv | wc -l reported less than the expected number of volumes and that number kept changing.
  • A log of the CSI driver and external-provisioner confirmed that volumes were getting deleted.

@pohly
Copy link
Contributor Author

pohly commented Oct 13, 2020

I'd like to know why PV controller marks the PV for deletion first.

I ran the following Go program in parallel to the test:

/*
Copyright 2020 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package main

import (
	"context"
	"os"
	"time"

	"github.com/google/go-cmp/cmp"
	"k8s.io/client-go/informers"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/cache"
	"k8s.io/client-go/tools/clientcmd"
	"k8s.io/klog/v2"
	"sigs.k8s.io/yaml"
)

func toYAML(obj interface{}) string {
	out, err := yaml.Marshal(obj)
	if err != nil {
		klog.Fatalf("marshal %+q: %v", obj, err)
	}
	return string(out)
}

func main() {
	ctx := context.Background()

	// get the KUBECONFIG from env if specified (useful for local/debug cluster)
	kubeconfigEnv := os.Getenv("KUBECONFIG")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfigEnv)
	if err != nil {
		klog.Fatalf("Failed to create config from KUBECONFIG=%s: %v", kubeconfigEnv, err)
	}

	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		klog.Fatalf("Failed to create client: %v", err)
	}

	factory := informers.NewSharedInformerFactory(clientset, time.Hour)
	claimInformer := factory.Core().V1().PersistentVolumeClaims().Informer()
	volumeInformer := factory.Core().V1().PersistentVolumes().Informer()

	claimHandler := cache.ResourceEventHandlerFuncs{
		AddFunc: func(obj interface{}) {
			klog.Infof("PVC added:\n%s\n", toYAML(obj))
		},
		UpdateFunc: func(oldObj, newObj interface{}) {
			klog.Infof("PVC updated:\n%s\n%s\n",
				toYAML(newObj),
				cmp.Diff(oldObj, newObj),
			)
		},
		DeleteFunc: func(obj interface{}) {
			klog.Infof("PVC deleted:\n%s\n", toYAML(obj))
		},
	}
	claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, time.Hour)

	volumeHandler := cache.ResourceEventHandlerFuncs{
		AddFunc: func(obj interface{}) {
			klog.Infof("PV added:\n%s\n", toYAML(obj))
		},
		UpdateFunc: func(oldObj, newObj interface{}) {
			klog.Infof("PV updated:\n%s\n%s\n",
				toYAML(newObj),
				cmp.Diff(oldObj, newObj),
			)
		},
		DeleteFunc: func(obj interface{}) {
			klog.Infof("PV deleted:\n%s\n", toYAML(obj))
		},
	}
	volumeInformer.AddEventHandlerWithResyncPeriod(volumeHandler, time.Hour)

	factory.Start(ctx.Done())
	for {
	}
}

It shows that a PV that gets deleted prematurely goes through these changes:

I1013 14:06:54.183598   99568 watch-pvs.go:78] PV added:
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: pmem-csi.intel.com
  creationTimestamp: "2020-10-13T12:06:23Z"
  finalizers:
  - kubernetes.io/pv-protection
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:pv.kubernetes.io/provisioned-by: {}
      f:spec:
        f:accessModes: {}
        f:capacity:
          .: {}
          f:storage: {}
        f:claimRef:
          .: {}
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:csi:
          .: {}
          f:driver: {}
          f:fsType: {}
          f:volumeAttributes:
            .: {}
            f:eraseafter: {}
            f:name: {}
            f:storage.kubernetes.io/csiProvisionerIdentity: {}
          f:volumeHandle: {}
        f:nodeAffinity:
          .: {}
          f:required:
            .: {}
            f:nodeSelectorTerms: {}
        f:persistentVolumeReclaimPolicy: {}
        f:storageClassName: {}
        f:volumeMode: {}
      f:status:
        f:phase: {}
    manager: csi-provisioner
    operation: Update
    time: "2020-10-13T12:06:23Z"
  name: pvc-4c784920-8b35-48b3-a630-291c8273c4d3
  resourceVersion: "6877870"
  selfLink: /api/v1/persistentvolumes/pvc-4c784920-8b35-48b3-a630-291c8273c4d3
  uid: 638f91ed-38dc-4471-b2af-0d739d0c13f8
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: vol-1
    namespace: test-v5vl6e-7
    resourceVersion: "6877816"
    uid: 4c784920-8b35-48b3-a630-291c8273c4d3
  csi:
    driver: pmem-csi.intel.com
    fsType: ext4
    volumeAttributes:
      eraseafter: "true"
      name: pvc-4c784920-8b35-48b3-a630-291c8273c4d3
      storage.kubernetes.io/csiProvisionerIdentity: 1602238400625-8081-pmem-csi.intel.com
    volumeHandle: pvc-4c-483b6881907d7b351064dbd5a801cdb894cc43150b0b5b70a015da58
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: pmem-csi.intel.com/node
          operator: In
          values:
          - pmem-node-42
  persistentVolumeReclaimPolicy: Delete
  storageClassName: pmem-csi-sc-ext4
  volumeMode: Filesystem
status:
  phase: Pending
...
I1013 14:06:54.221803   99568 watch-pvs.go:81] PV updated:
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: pmem-csi.intel.com
  creationTimestamp: "2020-10-13T12:06:23Z"
  finalizers:
  - kubernetes.io/pv-protection
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:pv.kubernetes.io/provisioned-by: {}
      f:spec:
        f:accessModes: {}
        f:capacity:
          .: {}
          f:storage: {}
        f:claimRef:
          .: {}
          f:apiVersion: {}
          f:kind: {}
          f:name: {}
          f:namespace: {}
          f:resourceVersion: {}
          f:uid: {}
        f:csi:
          .: {}
          f:driver: {}
          f:fsType: {}
          f:volumeAttributes:
            .: {}
            f:eraseafter: {}
            f:name: {}
            f:storage.kubernetes.io/csiProvisionerIdentity: {}
          f:volumeHandle: {}
        f:nodeAffinity:
          .: {}
          f:required:
            .: {}
            f:nodeSelectorTerms: {}
        f:persistentVolumeReclaimPolicy: {}
        f:storageClassName: {}
        f:volumeMode: {}
    manager: csi-provisioner
    operation: Update
    time: "2020-10-13T12:06:23Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:phase: {}
    manager: kube-controller-manager
    operation: Update
    time: "2020-10-13T12:06:23Z"
  name: pvc-4c784920-8b35-48b3-a630-291c8273c4d3
  resourceVersion: "6877881"
  selfLink: /api/v1/persistentvolumes/pvc-4c784920-8b35-48b3-a630-291c8273c4d3
  uid: 638f91ed-38dc-4471-b2af-0d739d0c13f8
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: vol-1
    namespace: test-v5vl6e-7
    resourceVersion: "6877816"
    uid: 4c784920-8b35-48b3-a630-291c8273c4d3
  csi:
    driver: pmem-csi.intel.com
    fsType: ext4
    volumeAttributes:
      eraseafter: "true"
      name: pvc-4c784920-8b35-48b3-a630-291c8273c4d3
      storage.kubernetes.io/csiProvisionerIdentity: 1602238400625-8081-pmem-csi.intel.com
    volumeHandle: pvc-4c-483b6881907d7b351064dbd5a801cdb894cc43150b0b5b70a015da58
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: pmem-csi.intel.com/node
          operator: In
          values:
          - pmem-node-42
  persistentVolumeReclaimPolicy: Delete
  storageClassName: pmem-csi-sc-ext4
  volumeMode: Filesystem
status:
  phase: Released

  &v1.PersistentVolume{
  	TypeMeta: v1.TypeMeta{},
  	ObjectMeta: v1.ObjectMeta{
  		... // 3 identical fields
  		SelfLink:          "/api/v1/persistentvolumes/pvc-4c784920-8b35-48b3-a630-291c8273c4d3",
  		UID:               "638f91ed-38dc-4471-b2af-0d739d0c13f8",
- 		ResourceVersion:   "6877870",
+ 		ResourceVersion:   "6877881",
  		Generation:        0,
  		CreationTimestamp: v1.Time{Time: s"2020-10-13 14:06:23 +0200 CEST"},
  		... // 5 identical fields
  		Finalizers:  []string{"kubernetes.io/pv-protection"},
  		ClusterName: "",
  		ManagedFields: []v1.ManagedFieldsEntry{
  			{
  				... // 3 identical fields
  				Time:       s"2020-10-13 14:06:23 +0200 CEST",
  				FieldsType: "FieldsV1",
  				FieldsV1: &v1.FieldsV1{
  					Raw: bytes.Join({
  						... // 462 identical bytes
  						`,"f:required":{".":{},"f:nodeSelectorTerms":{}}},"f:persistentVo`,
  						`lumeReclaimPolicy":{},"f:storageClassName":{},"f:volumeMode":{}}`,
- 						`,"f:status":{"f:phase":{}}`,
  						"}",
  					}, ""),
  				},
  			},
+ 			{
+ 				Manager:    "kube-controller-manager",
+ 				Operation:  "Update",
+ 				APIVersion: "v1",
+ 				Time:       s"2020-10-13 14:06:23 +0200 CEST",
+ 				FieldsType: "FieldsV1",
+ 				FieldsV1:   s`{"f:status":{"f:phase":{}}}`,
+ 			},
  		},
  	},
  	Spec: v1.PersistentVolumeSpec{Capacity: v1.ResourceList{s"storage": {i: resource.int64Amount{value: 10737418240}, s: "10Gi", Format: "BinarySI"}}, PersistentVolumeSource: v1.PersistentVolumeSource{CSI: &v1.CSIPersistentVolumeSource{Driver: "pmem-csi.intel.com", VolumeHandle: "pvc-4c-483b6881907d7b351064dbd5a801cdb894cc43150b0b5b70a015da58", FSType: "ext4", VolumeAttributes: map[string]string{"eraseafter": "true", "name": "pvc-4c784920-8b35-48b3-a630-291c8273c4d3", "storage.kubernetes.io/csiProvisionerIdentity": "1602238400625-8081-pmem-csi.intel.com"}}}, AccessModes: []v1.PersistentVolumeAccessMode{"ReadWriteOnce"}, ClaimRef: &v1.ObjectReference{Kind: "PersistentVolumeClaim", Namespace: "test-v5vl6e-7", Name: "vol-1", UID: "4c784920-8b35-48b3-a630-291c8273c4d3", APIVersion: "v1", ResourceVersion: "6877816"}, PersistentVolumeReclaimPolicy: "Delete", StorageClassName: "pmem-csi-sc-ext4", VolumeMode: &"Filesystem", NodeAffinity: &v1.VolumeNodeAffinity{Required: &v1.NodeSelector{NodeSelectorTerms: []v1.NodeSelectorTerm{{MatchExpressions: []v1.NodeSelectorRequirement{{Key: "pmem-csi.intel.com/node", Operator: "In", Values: []string{"pmem-node-42"}}}}}}}},
  	Status: v1.PersistentVolumeStatus{
- 		Phase:   "Pending",
+ 		Phase:   "Released",
  		Message: "",
  		Reason:  "",
  	},
  }

@pohly
Copy link
Contributor Author

pohly commented Oct 13, 2020

So yes, the status goes from "pending" to "released" without ever getting bound to the PVC.

@pohly
Copy link
Contributor Author

pohly commented Oct 13, 2020

The PVC exists while this is happening.

Could it be that kube-controller-manager relies on a PVC informer cache to determine whether the PVC exists? Then it might see a PV before the PVC cache gets updated and incorrectly conclude that the PV was released.

@msau42
Copy link
Collaborator

msau42 commented Oct 13, 2020

I seem to recall this issue and remember that we fixed this by doing a live lookup https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller.go#L585.

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

That fix went into Kubernetes 1.12. I'm running 1.18, so this must be something else.

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

Trying this with the CSI hostpath driver crashes the driver:

I1014 10:06:04.859602       1 server.go:118] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.hostpath.csi/node":"127.0.0.1"}}],"requisite":[{"segments":{"topology.hostpath.csi/node":"127.0.0.1"}}]},"capacity_range":{"required_bytes":8589934592},"name":"pvc-0aa61cd5-330b-48eb-8b6f-73206674e38b","volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":1}}]}
fatal error: concurrent map iteration and map write

goroutine 213 [running]:
runtime.throw(0x9be076, 0x26)
	/go/pkg/csiprow.XXXXogDpOo/go-1.13.3/src/runtime/panic.go:774 +0x72 fp=0xc0012fd510 sp=0xc0012fd4e0 pc=0x42da82
runtime.mapiternext(0xc0012fd5e0)
	/go/pkg/csiprow.XXXXogDpOo/go-1.13.3/src/runtime/map.go:858 +0x579 fp=0xc0012fd598 sp=0xc0012fd510 pc=0x40f009
github.com/kubernetes-csi/csi-driver-host-path/pkg/hostpath.getVolumeByName(0xc000248090, 0x28, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/workspace/pkg/hostpath/hostpath.go:184 +0xb2 fp=0xc0012fd720 sp=0xc0012fd598 pc=0x870852
github.com/kubernetes-csi/csi-driver-host-path/pkg/hostpath.(*controllerServer).CreateVolume(0xc000180990, 0xa68aa0, 0xc000180ff0, 0xc0002442a0, 0xc000180990, 0xd9895e, 0xc0002441c0)
	/workspace/pkg/hostpath/controllerserver.go:126 +0x190 fp=0xc0012fd9d0 sp=0xc0012fd720 pc=0x86a4d0
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler.func1(0xa68aa0, 0xc000180ff0, 0x983320, 0xc0002442a0, 0xc0005afaa0, 0x1, 0x1, 0x30)
	/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5146 +0x86 fp=0xc0012fda18 sp=0xc0012fd9d0 pc=0x84da36
github.com/kubernetes-csi/csi-driver-host-path/pkg/hostpath.logGRPC(0xa68aa0, 0xc000180ff0, 0x983320, 0xc0002442a0, 0xc00000c1e0, 0xc00000c200, 0xc000233b30, 0x56dc78, 0x959240, 0xc000180ff0)
	/workspace/pkg/hostpath/server.go:119 +0x155 fp=0xc0012fdad0 sp=0xc0012fda18 pc=0x878c65
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler(0x9709c0, 0xc000180990, 0xa68aa0, 0xc000180ff0, 0xc0000f0240, 0x9d2430, 0xa68aa0, 0xc000180ff0, 0xc0002360a0, 0x92)
	/go/pkg/mod/github.com/container-storage-interface/spec@v1.2.0/lib/go/csi/csi.pb.go:5148 +0x14b fp=0xc0012fdb40 sp=0xc0012fdad0 pc=0x84a4eb
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0000f2480, 0xa6d420, 0xc0004eaf00, 0xc000219200, 0xc000180c00, 0xdef1a0, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1024 +0x4f4 fp=0xc0012fde18 sp=0xc0012fdb40 pc=0x81f5f4
google.golang.org/grpc.(*Server).handleStream(0xc0000f2480, 0xa6d420, 0xc0004eaf00, 0xc000219200, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1313 +0xd97 fp=0xc0012fdf48 sp=0xc0012fde18 pc=0x823317
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0003a7a50, 0xc0000f2480, 0xa6d420, 0xc0004eaf00, 0xc000219200)
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:722 +0xbb fp=0xc0012fdfb8 sp=0xc0012fdf48 pc=0x82fa3b
runtime.goexit()
	/go/pkg/csiprow.XXXXogDpOo/go-1.13.3/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0012fdfc0 sp=0xc0012fdfb8 pc=0x459ea1
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:720 +0xa1

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

I thought we had serialized gRPC call processing, but apparently not.

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

No, we only discussed it: kubernetes-csi/csi-driver-host-path#72 (comment)

@msau42
Copy link
Collaborator

msau42 commented Oct 14, 2020 via email

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

Can you investigate why that fix isn't working for csi?

Yes, I'm looking into it.

So far I have not been able to reproduce it with a cluster brought up with local-up.sh, so I'll have to go back to the larger 1.18.8 cluster and see whether logs show anything about this transition to "phase: Released".

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

So far I have not been able to reproduce it with a cluster brought up with local-up.sh

I now have managed to reproduce it there: the key point was that external-provisioner had to run with qps + burst values that must allow it to create PVs very quickly.

One example PV:

I1014 20:56:52.143441 1021339 pv_controller_base.go:188] enqueued "pvc-0319043d-8e24-4849-8f81-e01a75cb5d19" for sync
I1014 20:56:52.143457 1021339 pv_controller_base.go:414] volumeWorker[pvc-0319043d-8e24-4849-8f81-e01a75cb5d19]
I1014 20:56:52.143466 1021339 pv_controller_base.go:607] storeObjectUpdate: adding volume "pvc-0319043d-8e24-4849-8f81-e01a75cb5d19", version 1123
I1014 20:56:52.143485 1021339 pv_controller.go:537] synchronizing PersistentVolume[pvc-0319043d-8e24-4849-8f81-e01a75cb5d19]: phase: Pending, bound to: "test-jznovn-1/vol-210 (uid: 0319043d-8e24-4849-8f81-e01a75cb5d19)", boundByController: false
I1014 20:56:52.143499 1021339 pv_controller.go:572] synchronizing PersistentVolume[pvc-0319043d-8e24-4849-8f81-e01a75cb5d19]: volume is bound to claim test-jznovn-1/vol-210
I1014 20:56:52.143511 1021339 pv_controller.go:605] synchronizing PersistentVolume[pvc-0319043d-8e24-4849-8f81-e01a75cb5d19]: claim test-jznovn-1/vol-210 not found
I1014 20:56:52.143519 1021339 pv_controller.go:633] volume "pvc-0319043d-8e24-4849-8f81-e01a75cb5d19" is released and reclaim policy "Delete" will be executed
I1014 20:56:52.143537 1021339 pv_controller.go:838] updating PersistentVolume[pvc-0319043d-8e24-4849-8f81-e01a75cb5d19]: set phase Released

So it is the code that should have done the live lookup... but didn't do it because of https://github.com/kubernetes/kubernetes/blob/7ff41c1ba0f38920bb0bc09e70dc70e2205be155/pkg/controller/volume/persistentvolume/pv_controller.go#L584

"pv.kubernetes.io/bound-by-controller" is not set by external-provisioner. Should it do that?

There is only "pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io" in the newly created PVs.

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

"pv.kubernetes.io/bound-by-controller" is not set by external-provisioner. Should it do that?

According to the description of that annotation, no.

That makes the code in Kubernetes incorrect.

@pohly
Copy link
Contributor Author

pohly commented Oct 14, 2020

Removing metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnBoundByController) from the if check fixes the problem for me.

@jsafrane
Copy link
Contributor

Removing metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnBoundByController) from the if check fixes the problem for me.

I think this could be actually the right fix. At this point of PV controller we know that the PV is bound by UID, which is a clear signal the PVC has existed at some point it time (otherwise pv.spec.claimRef.UID would be empty) and AnnBoundByController should not matter.

@jsafrane
Copy link
Contributor

Wait. How is it possible that a PVC is dynamically provisioned and PV controller does not know about it (ie. does not have the PVC in its cache)? Who puts volume.beta.kubernetes.io/storage-provisioner annotation there? Is it scheduler? @pohly, do you use late binding in your test?

@pohly
Copy link
Contributor Author

pohly commented Oct 15, 2020

Who puts volume.beta.kubernetes.io/storage-provisioner annotation there?

The modified clusterload2 test does that, if configured to do so - see kubernetes/perf-tests#1530

I added that after I noticed that I couldn't trigger volume creation as quickly as I wanted when relying on the PV controller. With this shortcut, the test then becomes more of a microbenchmark for external-provisioner + CSI driver, but that's exactly what I wanted. Right now it's still waiting for the PV controller to bind the PVs, but I can replace that with "wait for n PVs to exist" and then the PV controller will be completely out of the critical path.

We should debate whether this is actually something that creators of a PVC are allowed to do. It works, so even if it is unintentional, it's probably too late to prevent it via PVC validation.

IMHO it is useful and should be supported. If we agree, then I'll add it to the code that creates PVCs for generic ephemeral volumes, which will save one roundtrip through the API server and the PV controller when creating those volumes.

do you use late binding in your test?

No, immediate binding.

@msau42
Copy link
Collaborator

msau42 commented Oct 15, 2020 via email

@msau42
Copy link
Collaborator

msau42 commented Oct 15, 2020 via email

@pohly
Copy link
Contributor Author

pohly commented Oct 15, 2020

I think it's odd to benchmark the provisioner sidecar but still involve the
latter half of pv controller.

Agreed. I'll change that.

If we want to benchmark both pv controller
and sidecar, then we should include the whole sequence.

Also agreed.

If we want to
benchmark just the sidecar then we could do that with a unit or integration
test.

Here I disagree. I want to benchmark the scenario where I have a large number of external-provisioner instances (one per node) and they all cooperate together on provisioning a high number of volumes. Doing it with a single instance is just the baseline that I need to compare against. With the clusterload2 test, I can scale out on a real cluster and it also gives me the apiserver metrics data, so I think it is a better tool for the job than a unit or integration test.

@pohly
Copy link
Contributor Author

pohly commented Nov 3, 2020

We decided against backporting the Kubernetes change unless there's a good reason why someone needs it.

When doing scale testing, QPS and burst settings can be set high enough for kube-controller-manager that bypassing it is not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants