Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Generic Batching Framework + Batch DescribeVolume API requests #1819

Merged
merged 2 commits into from
Nov 9, 2023

Conversation

torredil
Copy link
Member

Is this a bug fix or adding new feature?

New feature.

What is this PR about? / Why do we need it?

This PR implements a generic batching framework, designed to facilitate the aggregation and execution of tasks -- most notably batched API requests for DescribeVolume -- used by the driver to poll for volume state and determine attachment status.

Today, the driver makes one DV API call per volume, however, DV can return up to 500 results. These changes allow for utilizing the API's capacity to handle up to 500 volumes at once, which significantly helps mitigate rate limit issues with DV and enhance performance when using the driver under heavy load scenarios as a result.

There are two parts to this PR, separated by commits:

  1. Generic Batching Framework.

  2. Batch DescribeVolume API Requests.

What testing is done?

  • make test
  • CI
  • Manual testing

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 31, 2023
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 31, 2023
@torredil
Copy link
Member Author

Kustomize CI job needs to be updated. Driver fails to enter a running state because --batching is passed into the controller container for driver version 1.24 (need to deploy driver using image built for CI instead).

@ConnorJC3 ConnorJC3 force-pushed the master branch 2 times, most recently from 24a8e7b to bddbe0b Compare November 1, 2023 18:08
@torredil
Copy link
Member Author

torredil commented Nov 2, 2023

Re Kustomize CI job needs to be updated -- we decided to run pull-aws-ebs-csi-driver-external-test-kustomize only on release branches. This is better than doing in-place modifications of the image repository / tag values.

An additional advantage in doing the above is that we implicitly validate image promotion for new releases moving forward since we pull from the official public ECR gallery.

@torredil
Copy link
Member Author

torredil commented Nov 2, 2023

/hold will retest once changes to test-infra have been merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2023
@torredil
Copy link
Member Author

torredil commented Nov 2, 2023

/retest

@ConnorJC3
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@ConnorJC3: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ConnorJC3
Copy link
Contributor

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Nov 3, 2023
@k8s-ci-robot
Copy link
Contributor

@ConnorJC3: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 3, 2023
@torredil
Copy link
Member Author

torredil commented Nov 3, 2023

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2023
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs rebase, one comment, otherwise lgtm

pkg/cloud/cloud.go Outdated Show resolved Hide resolved
docs/options.md Outdated Show resolved Hide resolved
Signed-off-by: Eddie Torres <torredil@amazon.com>
Copy link
Contributor

@AndrewSirenko AndrewSirenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
(Minus the cloud_test merge conflict and 1 question)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2023
Signed-off-by: Eddie Torres <torredil@amazon.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2023
@ConnorJC3
Copy link
Contributor

/lgtm
(reapplying @AndrewSirenko's lgtm)

/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorJC3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2023
@torredil
Copy link
Member Author

torredil commented Nov 7, 2023

/retest

@torredil
Copy link
Member Author

torredil commented Nov 8, 2023

For peace of mind, investigated the resizing failure in pull-aws-ebs-csi-driver-e2e-single-az:

1 driver.go:125] "GRPC error" err="rpc error: code = Internal desc = Could not resize volume \"vol-003bbe6304464240d\": rpc error: code = Internal desc = Could not modify volume \"vol-003bbe6304464240d\": volume \"vol-003bbe6304464240d\" in OPTIMIZING state, cannot currently modify"

In conclusion, the error observed above is unrelated to this PR and occurs when a volume takes longer than 1 minute to transition from an optimizing state to completed. See a prior example here. This behavior can be easily reproduced outside of the driver by simply modifying a volume via the CLI or console.

Based on my observation, it is fairly common to see a volume remain in optimizing 99% for up to 5 minutes before transitioning to complete. However, the single AZ tests expect resizing to succeed within one minute:

Nov  7 17:54:03.198: INFO: Unexpected error: 
  <*errors.errorString | 0xc000a9b9a0>: 
    Gave up after waiting 1m0s for pv "pvc-af4167dc-d33b-4b78-a3f1-35d6e2856d2d" to complete resizing
  {
       s: "Gave up after waiting 1m0s for pv \"pvc-af4167dc-d33b-4b78-a3f1-35d6e2856d2d\" to complete resizing",
  }

Given that we recently introduced more tests that exercise this code path in single-az, we are more likely to observe the error moving forward if that timeout value is not relaxed.

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Nov 8, 2023

@torredil: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-aws-ebs-csi-driver-external-test-kustomize c1a894c link false /test pull-aws-ebs-csi-driver-external-test-kustomize

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@torredil
Copy link
Member Author

torredil commented Nov 8, 2023

ResizeOrModifyDisk needs to be revisited. There is a discrepancy in the logic.

According to cloud, a modification is considered done if the volume is an optimizing state:

func volumeModificationDone(state string) bool {
if state == ec2.VolumeModificationStateCompleted || state == ec2.VolumeModificationStateOptimizing {
return true
}

In the case where ResizeOrModifyDisk hasn't been previously called for a volume, we simply move on to check if the volume is in the desired state. If it is, ResizeOrModifyDisk succeeds even if the volume is still optimizing. However, if checkDesiredState returns an error or the sidecar times out, latestMod won't be nil the second time around and we error out if the volume is optimizing:

} else if state == ec2.VolumeModificationStateOptimizing {
return true, 0, fmt.Errorf("volume %q in OPTIMIZING state, cannot currently modify", volumeID)
}

@torredil
Copy link
Member Author

torredil commented Nov 9, 2023

/retest
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2023
@torredil
Copy link
Member Author

torredil commented Nov 9, 2023

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2023
@k8s-ci-robot k8s-ci-robot merged commit 2071f80 into kubernetes-sigs:master Nov 9, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants