AWS disk modification wait method updated to return faster

Summary: We saw that increasing disk size on AWS can take up to a couple of hours. The main reason for that is the optimization that AWS is doing on the disk behind the scene. [[ This page | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-modifications.html ]] says this operation can up to 24 hours. In this diff, the AWS disk modification wait method has changed such that it returns as soon as the volume modification state is "optimizing" rather than "completed". Please note the added disk size is accessible while it is in the "optimizing" state, but it may not have the optimized performance until the state is "completed". Test Plan: The resizeNode operation that uses the changed method in this diff was called with different values while the sample app was running. Reviewers: arnav, sanketh Reviewed By: sanketh Subscribers: jenkins-bot, yugaware Differential Revision: https://phabricator.dev.yugabyte.com/D12414
ymahajan · Jul 28, 2021 · d1f8fc0 · d1f8fc0
1 parent 652d3cd
commit d1f8fc0
Showing 1 changed file with 19 additions and 14 deletions.
diff --git a/managed/devops/opscli/ybops/cloud/aws/utils.py b/managed/devops/opscli/ybops/cloud/aws/utils.py
@@ -1077,27 +1077,32 @@ def _update_dns_record_set(hosted_zone_id, domain_name_prefix, ip_list, action):
 
 
 def _wait_for_disk_modifications(ec2_client, vol_ids):
-    num_vols_completed = 0
+    # This function returns as soon as the volume state is optimizing, not completed.
     num_vols_to_modify = len(vol_ids)
-    # It should retry for a 6 hour limit
-    retry_num = int((6 * 3600) / AbstractCloud.SSH_WAIT_SECONDS) + 1
-    # Loop till the progress is at 100 or the limit is reached
-    while retry_num != 0:
+    # It should retry for a 1 hour time limit.
+    retry_num = int((1 * 3600) / AbstractCloud.SSH_WAIT_SECONDS) + 1
+    # Loop till all volumes are modified or the limit is reached.
+    while retry_num > 0:
+        num_vols_modified = 0
         response = ec2_client.describe_volumes_modifications(VolumeIds=vol_ids)
         # The response format can be found here:
         # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.describe_volumes_modifications
-        for entry in response['VolumesModifications']:
-            if entry['Progress'] == 100:
-                if entry['ModificationState'] != 'completed':
-                    raise YBOpsRuntimeError(("Disk {} could not be modified.").format(
-                        entry['VolumeId']))
-                else:
-                    num_vols_completed += 1
+        for entry in response["VolumesModifications"]:
+            if entry["ModificationState"] == "failed":
+                raise YBOpsRuntimeError(("Mofication of disk {} failed.").format(
+                    entry['VolumeId']))
+
+            if entry["ModificationState"] == "optimizing" or \
+                    entry["ModificationState"] == "completed":
+                # Modifying completed.
+                num_vols_modified += 1
+
         # This means all volumes have completed modification.
-        if num_vols_completed == num_vols_to_modify:
+        if num_vols_modified == num_vols_to_modify:
             break
+
         time.sleep(AbstractCloud.SSH_WAIT_SECONDS)
         retry_num -= 1
 
-    if retry_num == 0:
+    if retry_num <= 0:
         raise YBOpsRuntimeError("wait_for_disk_modifications failed. Retry limit reached.")