Skip to content

PVC used by a job doesn't get resize after the pod of the job completed #175

@PhanLe1010

Description

@PhanLe1010

Summary:
We have a setup in which the external-resizer is used with the storage provider that only supports offline expansion (e.g., only supports PluginCapability_VolumeExpansion_OFFLINE). We deployed a job that uses a PVC provisioned by the storage provider. While the job pod is running, we resize the PVC by modifying spec.resources.requests.storage. The PVC cannot be resized while the pod is running as expected. However, after the job pod is completed, the PVC still doesn't get resized. external-resizerdoesn't send resizing gRPC call to the storage provider. The PVC is stuck in this state forever until we manually delete the job pod.

Reproduce steps:

  1. Deploy external-resizer together with a storage provider (we use Longhorn)

  2. Don't set the --handle-volume-inuse-error flag for the external-resizer . It means that by default, external-resizer will handle handle volume in use error in resizer controller, link

  3. Deploy a job that uses a PVC as below. The job creates a pod that will sleep for 2 minutes and complete.

    Click to open
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-job-pvc
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: longhorn
      resources:
        requests:
          storage: 1Gi
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-job
      namespace: default
    spec:
      backoffLimit: 1
      template:
        metadata:
          name: test-job
        spec:
          containers:
            - name: test-job
              image: ubuntu:latest
              imagePullPolicy: IfNotPresent
              securityContext:
                privileged: true
              command: ["/bin/sh"]
              args: ["-c", "echo 'sleep for 120s then exit'; sleep 120"]
              volumeMounts:
                - mountPath: /data
                  name: vol
          restartPolicy: OnFailure
          volumes:
            - name: vol
              persistentVolumeClaim:
                claimName: test-job-pvc
    
  4. While the job pod become running, try to expand the PVC by editing the spec.resources.requests.storage

  5. Observe that the resizing fail

  6. Wait for the job pod to become completed.

  7. Observer that that PVC stuck in the current state forever. It doesn't get resized because external-resizer doesn't attempt to make gRPC expanding call to the storage provider.

Expected Behavior:

Once the job pod is completed, the PVC is no longer consider to be in-used. Therefore external-resizer should attempt to make gRPC expanding call to the storage provider.

Propose:
We dig into the source code see that:

  1. This checker prevent the external-resizer from retrying if the PVC has InUseErrors before AND it is in the ctrl.usedPVCs map
  2. The problem is that the PVC is never removed from the ctrl.usedPVCs map when a pod move to completed phase. PVC is only removed when the pod is deleted, link
  3. We think that the logic over here should be changed to handle the case when the pod become completed. I.e.,:
    func (ctrl *resizeController) updatePod(oldObj, newObj interface{}) {
        pod := parsePod(newObj)
        if pod == nil {
    	    return
        }
        
        if isPodTerminated(pod) {
    	    ctrl.usedPVCs.removePod(pod)
        } else {
    	    ctrl.usedPVCs.addPod(pod)
        }
    }
    

Evn:

  • external-resizer v1.2.0
  • Longhorn v1.2.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/storageCategorizes an issue or PR as relevant to SIG Storage.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions