Skip to content

CMEK Disk Creation Sometimes Fails with "disk already exists with same name" #558

@saad-ali

Description

@saad-ali

Problem

Provisioning of GCE PDs with CMEK enable sometimes fails with disk already exists with same name

  Type     Reason                Age                From                                                                                                 Message
  ----     ------                ----               ----                                                                                                 -------
  Warning  ProvisioningFailed    14s (x2 over 15s)  pd.csi.storage.gke.io_gke-cluster-1-default-pool-4cede575-43h6_de91f0bc-68b9-451d-826a-43e526adc6a1  failed to provision volume with StorageClass "csi-gce-pd-cmek": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   ExternalProvisioning  8s (x3 over 16s)   persistentvolume-controller                                                                          waiting for a volume to be created, either by external provisioner "pd.csi.storage.gke.io" or manually created by system administrator
  Normal   Provisioning          4s (x5 over 16s)   pd.csi.storage.gke.io_gke-cluster-1-default-pool-4cede575-43h6_de91f0bc-68b9-451d-826a-43e526adc6a1  External provisioner is provisioning volume for claim "default/pvc-demo"
  Warning  ProvisioningFailed    4s (x3 over 14s)   pd.csi.storage.gke.io_gke-cluster-1-default-pool-4cede575-43h6_de91f0bc-68b9-451d-826a-43e526adc6a1  failed to provision volume with StorageClass "csi-gce-pd-cmek": rpc error: code = AlreadyExists desc = CreateVolume disk already exists with same name and is incompatible: actual disk KMS key name projects/test-project/locations/us-central1/keyRings/TestKeyRing/cryptoKeys/test-key/cryptoKeyVersions/8 did not match expected param projects/test-project/locations/us-central1/keyRings/TestKeyRing/cryptoKeys/test-key

Repro Steps

  1. Deploy GCE PD CSI Driver with csi-provisioner sidecar parameter --timeout=1s
    • This will make it easier to simulate a timeout.
  2. Create a StorageClass enabling CMEK encryption:
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: csi-gce-pd-cmek
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"
    provisioner: pd.csi.storage.gke.io
    parameters:
      type: pd-standard
      disk-encryption-kms-key: projects/test-project/locations/us-central1/keyRings/TestKeyRing/cryptoKeys/test-key
    
  3. Provision a PVC using the StorageClass above.
    • It may take multiple tries to hit the timeout (but I was able to hit it on my first try once I reduced the timeout to 1sec).

Proposed Fixes

There are two fixes for this:

  1. Increase the timeout for the external-provisioner sidecar, this won't fix the issue, but it will reduce the likelihood of this happening.
  2. Make sure GCE PD CSI Driver CreateVolume call does not fail for CMEK if operation is retried

/assign

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions