CSI Driver Can Leak Access Points #1467

joelthompson · 2024-10-07T14:46:18Z

/kind bug

What happened?

We have the EFS CSI driver installed with dynamic provisioning and a reclaim policy of Delete.

A flood of requests came in, which caused the CSI controller to eventually fail its health check and get restarted. After restart, the controller kept trying to provision APs for new PVCs, but they kept failing with AccessPointAlreadyExists. The controller kept retrying and it kept failing. Eventually the PVCs were deleted, and the APs were leaked. Most likely, the APs were created but never recorded as being provisioned in K8s, thus causing the controller on restart to keep trying to recreate them.

What you expected to happen?

Upon restart, the controller should recognize that the Access Point was already created and "adopt" it. Alternatively, the controller should recognize that this isn't a retriable error and not retry. Finally, when the PVC is deleted, the AP should be deleted according to the reclaim policy. This shouldn't require enabling reuseAccessPoint.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

This code here:

aws-efs-csi-driver/pkg/cloud/cloud.go

Lines 190 to 195 in fe845cc

    
           if err != nil { 
        
           	if isAccessDenied(err) { 
        
           		return nil, ErrAccessDenied 
        
           	} 
        
           	return nil, fmt.Errorf("Failed to create access point: %v", err) 
        
           }

doesn't check to see if the error code is AccessPointAlreadyExists in which case it should return ErrAlreadyExists and thus the code path in

aws-efs-csi-driver/pkg/driver/controller.go

Lines 346 to 348 in fe845cc

    
           if err == cloud.ErrAlreadyExists { 
        
           	return nil, status.Errorf(codes.AlreadyExists, "Access Point already exists") 
        
           }

is never hit.

Environment

Kubernetes version (use kubectl version): 1.29
Driver version: 2.0.5

Please also attach debug logs to help us better diagnose

Instructions to gather debug logs can be found here

The text was updated successfully, but these errors were encountered:

jrakas-dev · 2024-10-09T20:14:47Z

Hi Joel, thanks for opening this issue. The team is looking into it. In the meantime, can you please provide any debug logs you might have so that we can better understand the issue? Thanks!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Driver Can Leak Access Points #1467

CSI Driver Can Leak Access Points #1467

joelthompson commented Oct 7, 2024

jrakas-dev commented Oct 9, 2024

CSI Driver Can Leak Access Points #1467

CSI Driver Can Leak Access Points #1467

Comments

joelthompson commented Oct 7, 2024

jrakas-dev commented Oct 9, 2024