feat(components): AWS SageMaker - Add optional parameter to allow tra…

…ining component to accept parameters related to Debugger (#4283) * Implemented debugger for training component with sample pipeline, unit tests, and integration test * Implemented changes from PR, refactored utils.py, made sample pipeline more succinct, removed hardcoding from integration tests * Added default parameter for sample pipeline and fixed grammar for sample README, refactored _utils.py for fstrings and fixed offset for errors * Removed aws secret lines * Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0. * Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0. * Removed extra files, cleaned integration test * Changed integration test to use sample debugger pipeline * Processing jobs created from debug rules will not terminate, fixing other small issues * Removed debug from pipeline definition, removed extra line, removed unused function * Changelog and image tag updates
kubeflow · Aug 19, 2020 · 3ebd075 · 3ebd075
1 parent f50fc0d
commit 3ebd075
Show file tree

Hide file tree

Showing 28 changed files with 493 additions and 88 deletions.
diff --git a/components/aws/sagemaker/Changelog.md b/components/aws/sagemaker/Changelog.md
@@ -4,6 +4,12 @@ The version of the AWS SageMaker Components is determined by the docker image ta
 Repository:  https://hub.docker.com/repository/docker/amazon/aws-sagemaker-kfp-components
 
 ---------------------------------------------
+**Change log for version 0.8.0**
+- Add functionality to configure SageMaker Debugger for Training component
+
+> Pull requests " [#4283](https://github.com/kubeflow/pipelines/pull/4283/)
+
+
 **Change log for version 0.7.0**
 - Add functionality to assume role when sending SageMaker requests
 
@@ -29,7 +35,7 @@ Repository:  https://hub.docker.com/repository/docker/amazon/aws-sagemaker-kfp-c
 
 
 **Change log for version 0.5.1**
-- Update region support for GroudTruth component
+- Update region support for GroundTruth component
 - Make `label_category_config` an optional parameter in Ground Truth component
 
 > Pull requests : [#3932](https://github.com/kubeflow/pipelines/pull/3932)

diff --git a/components/aws/sagemaker/Dockerfile b/components/aws/sagemaker/Dockerfile
@@ -23,8 +23,8 @@ RUN yum update -y \
     unzip
 
 RUN pip3 install \
-    boto3==1.13.19 \
-    sagemaker==1.54.0 \
+    boto3==1.14.12 \
+    sagemaker==2.1.0 \
     pathlib2==2.3.5 \
     pyyaml==3.12
 

diff --git a/components/aws/sagemaker/THIRD-PARTY-LICENSES.txt b/components/aws/sagemaker/THIRD-PARTY-LICENSES.txt
@@ -1,7 +1,7 @@
-** Amazon SageMaker Components for Kubeflow Pipelines; version 0.7.0 --
+** Amazon SageMaker Components for Kubeflow Pipelines; version 0.8.0 --
 https://github.com/kubeflow/pipelines/tree/master/components/aws/sagemaker
 Copyright 2019-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
-** boto3; version 1.12.33 -- https://github.com/boto/boto3/
+** boto3; version 1.14.12 -- https://github.com/boto/boto3/
 Copyright 2013-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 ** botocore; version 1.15.33 -- https://github.com/boto/botocore
 Botocore
@@ -12,7 +12,7 @@ https://importlib-metadata.readthedocs.io/en/latest/
 ** s3transfer; version 0.3.3 -- https://github.com/boto/s3transfer/
 s3transfer
 Copyright 2016 Amazon.com, Inc. or its affiliates. All Rights Reserved.
-** sagemaker; version 1.54.0 -- https://aws.amazon.com/sagemaker/
+** sagemaker; version 2.1.0 -- https://aws.amazon.com/sagemaker/
 Amazon SageMaker Python SDK
 Copyright 2017-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 ** smdebug-rulesconfig; version 0.1.2 --
@@ -982,4 +982,4 @@ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 OTHER DEALINGS IN THE SOFTWARE.
 
-For more information, please refer to <http://unlicense.org/>
+For more information, please refer to <http://unlicense.org/>
diff --git a/components/aws/sagemaker/batch_transform/component.yaml b/components/aws/sagemaker/batch_transform/component.yaml
@@ -102,7 +102,7 @@ outputs:
   - {name: output_location,    description: 'S3 URI of the transform job results.'}
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       batch_transform.py,

diff --git a/components/aws/sagemaker/common/_utils.py b/components/aws/sagemaker/common/_utils.py
@@ -22,6 +22,7 @@
 import re
 import json
 from pathlib2 import Path
+from enum import Enum, auto
 
 import boto3
 from boto3.session import Session
@@ -36,7 +37,7 @@
 from botocore.exceptions import ClientError
 from botocore.session import Session as BotocoreSession
 
-from sagemaker.amazon.amazon_estimator import get_image_uri
+from sagemaker.image_uris import retrieve
 
 import logging
 logging.getLogger().setLevel(logging.INFO)
@@ -99,6 +100,9 @@ def get_component_version():
     return component_version
 
 
+def print_log_header(header_len, title=""):
+    logging.info(f"{title:*^{header_len}}")
+
 def print_logs_for_job(cw_client, log_grp, job_name):
     """Gets the CloudWatch logs for SageMaker jobs"""
     try:
@@ -206,12 +210,12 @@ def create_training_job_request(args):
         # TODO: Adjust this implementation to account for custom algorithm resources names that are the same as built-in algorithm names
         algo_name = args['algorithm_name'].lower().strip()
         if algo_name in built_in_algos.keys():
-            request['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], built_in_algos[algo_name])
+            request['AlgorithmSpecification']['TrainingImage'] = retrieve(built_in_algos[algo_name], args['region'])
             request['AlgorithmSpecification'].pop('AlgorithmName')
             logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
         # Just to give the user more leeway for built-in algorithm name inputs
         elif algo_name in built_in_algos.values():
-            request['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], algo_name)
+            request['AlgorithmSpecification']['TrainingImage'] = retrieve(algo_name, args['region'])
             request['AlgorithmSpecification'].pop('AlgorithmName')
             logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
         else:
@@ -258,6 +262,17 @@ def create_training_job_request(args):
 
     enable_spot_instance_support(request, args)
 
+    ### Update DebugHookConfig and DebugRuleConfigurations
+    if args['debug_hook_config']:
+        request['DebugHookConfig'] = args['debug_hook_config']
+    else:
+        request.pop('DebugHookConfig')
+
+    if args['debug_rule_config']:
+        request['DebugRuleConfigurations'] = args['debug_rule_config']
+    else:
+        request.pop('DebugRuleConfigurations')
+
     ### Update tags
     for key, val in args['tags'].items():
         request['Tags'].append({'Key': key, 'Value': val})
@@ -282,18 +297,94 @@ def create_training_job(client, args):
 
 
 def wait_for_training_job(client, training_job_name, poll_interval=30):
-  while(True):
-    response = client.describe_training_job(TrainingJobName=training_job_name)
-    status = response['TrainingJobStatus']
-    if status == 'Completed':
-      logging.info("Training job ended with status: " + status)
-      break
-    if status == 'Failed':
-      message = response['FailureReason']
-      logging.info('Training failed with the following error: {}'.format(message))
-      raise Exception('Training job failed')
-    logging.info("Training job is still in status: " + status)
-    time.sleep(poll_interval)
+    while(True):
+        response = client.describe_training_job(TrainingJobName=training_job_name)
+        status = response['TrainingJobStatus']
+        if status == 'Completed':
+            logging.info("Training job ended with status: " + status)
+            break
+        if status == 'Failed':
+            message = response['FailureReason']
+            logging.info(f'Training failed with the following error: {message}')
+            raise Exception('Training job failed')
+        logging.info("Training job is still in status: " + status)
+        time.sleep(poll_interval)
+
+
+def wait_for_debug_rules(client, training_job_name, poll_interval=30):
+    first_poll = True
+    while(True):
+        response = client.describe_training_job(TrainingJobName=training_job_name)
+        if 'DebugRuleEvaluationStatuses' not in response:
+            break
+        if first_poll:
+            logging.info("Polling for status of all debug rules:")
+            first_poll = False
+        if DebugRulesStatus.from_describe(response) != DebugRulesStatus.INPROGRESS:
+            logging.info("Rules have ended with status:\n")
+            print_debug_rule_status(response, True)
+            break
+        print_debug_rule_status(response)
+        time.sleep(poll_interval)
+
+
+class DebugRulesStatus(Enum):
+    COMPLETED = auto()
+    ERRORED = auto()
+    INPROGRESS = auto()
+
+    @classmethod
+    def from_describe(self, response):
+        has_error = False
+        for debug_rule in response['DebugRuleEvaluationStatuses']:
+            if debug_rule['RuleEvaluationStatus'] == "Error":
+                has_error = True
+            if debug_rule['RuleEvaluationStatus'] == "InProgress":
+                return DebugRulesStatus.INPROGRESS
+        if has_error:
+            return DebugRulesStatus.ERRORED
+        else:
+            return DebugRulesStatus.COMPLETED
+
+
+def print_debug_rule_status(response, last_print=False):
+    """
+    Example of DebugRuleEvaluationStatuses:
+    response['DebugRuleEvaluationStatuses'] =
+        [{
+            "RuleConfigurationName": "VanishingGradient",
+            "RuleEvaluationStatus": "IssuesFound",
+            "StatusDetails": "There was an issue."
+        }]
+
+    If last_print is False:
+    INFO:root: - LossNotDecreasing: InProgress
+    INFO:root: - Overtraining: NoIssuesFound
+    ERROR:root:- CustomGradientRule: Error
+
+    If last_print is True:
+    INFO:root: - LossNotDecreasing: IssuesFound
+    INFO:root:   - RuleEvaluationConditionMet: Evaluation of the rule LossNotDecreasing at step 10 resulted in the condition being met
+    """
+    for debug_rule in response['DebugRuleEvaluationStatuses']:
+        line_ending = "\n" if last_print else ""
+        if 'StatusDetails' in debug_rule:
+            status_details = f"- {debug_rule['StatusDetails'].rstrip()}{line_ending}"
+            line_ending = ""
+        else:
+            status_details = ""
+        rule_status = f"- {debug_rule['RuleConfigurationName']}: {debug_rule['RuleEvaluationStatus']}{line_ending}"
+        if debug_rule['RuleEvaluationStatus'] == "Error":
+            log = logging.error
+            status_padding = 1
+        else:
+            log = logging.info
+            status_padding = 2
+
+        log(f"{status_padding * ' '}{rule_status}")
+        if last_print and status_details:
+            log(f"{(status_padding + 2) * ' '}{status_details}")
+    print_log_header(50)
 
 
 def get_model_artifacts_from_job(client, job_name):
@@ -314,10 +405,13 @@ def get_image_from_job(client, job_name):
 
 
 def stop_training_job(client, job_name):
-    try:
-        client.stop_training_job(TrainingJobName=job_name)
-    except ClientError as e:
-        raise Exception(e.response['Error']['Message'])
+    response = client.describe_training_job(TrainingJobName=job_name)
+    if response["TrainingJobStatus"] == "InProgress":
+        try:
+            client.stop_training_job(TrainingJobName=job_name)
+            return job_name
+        except ClientError as e:
+            raise Exception(e.response['Error']['Message'])
 
 
 def create_model(client, args):
@@ -611,12 +705,12 @@ def create_hyperparameter_tuning_job_request(args):
         # TODO: Adjust this implementation to account for custom algorithm resources names that are the same as built-in algorithm names
         algo_name = args['algorithm_name'].lower().strip()
         if algo_name in built_in_algos.keys():
-            request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], built_in_algos[algo_name])
+            request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = retrieve(built_in_algos[algo_name], args['region'])
             request['TrainingJobDefinition']['AlgorithmSpecification'].pop('AlgorithmName')
             logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
         # To give the user more leeway for built-in algorithm name inputs
         elif algo_name in built_in_algos.values():
-            request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], algo_name)
+            request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = retrieve(algo_name, args['region'])
             request['TrainingJobDefinition']['AlgorithmSpecification'].pop('AlgorithmName')
             logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
         else:
@@ -1135,4 +1229,4 @@ def write_output(output_path, output_value, json_encode=False):
     write_value = json.dumps(output_value) if json_encode else output_value 
 
     Path(output_path).parent.mkdir(parents=True, exist_ok=True)
-    Path(output_path).write_text(write_value)
+    Path(output_path).write_text(write_value)
diff --git a/components/aws/sagemaker/common/train.template.yaml b/components/aws/sagemaker/common/train.template.yaml
@@ -21,10 +21,12 @@ VpcConfig:
 StoppingCondition:
   MaxRuntimeInSeconds: 86400
   MaxWaitTimeInSeconds: 86400
+DebugHookConfig: {}
+DebugRuleConfigurations: []
 CheckpointConfig:
   S3Uri: ''
   LocalPath: ''
 Tags: []
 EnableNetworkIsolation: True
 EnableInterContainerTrafficEncryption: False
-EnableManagedSpotTraining: False
+EnableManagedSpotTraining: False
diff --git a/components/aws/sagemaker/deploy/component.yaml b/components/aws/sagemaker/deploy/component.yaml
@@ -108,7 +108,7 @@ outputs:
   - {name: endpoint_name,          description: 'Endpoint name'}
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       deploy.py,

diff --git a/components/aws/sagemaker/ground_truth/component.yaml b/components/aws/sagemaker/ground_truth/component.yaml
@@ -123,7 +123,7 @@ outputs:
   - {name: active_learning_model_arn, description: 'The ARN for the most recent Amazon SageMaker model trained as part of automated data labeling.'}
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       ground_truth.py,
@@ -161,4 +161,4 @@ implementation:
       --tags, {inputValue: tags},
       --output_manifest_location_output_path, {outputPath: output_manifest_location},
       --active_learning_model_arn_output_path, {outputPath: active_learning_model_arn}
-    ]
+    ]
diff --git a/components/aws/sagemaker/hyperparameter_tuning/component.yaml b/components/aws/sagemaker/hyperparameter_tuning/component.yaml
@@ -154,7 +154,7 @@ outputs:
     description: 'The registry path of the Docker image that contains the training algorithm'
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       hyperparameter_tuning.py,
@@ -200,4 +200,4 @@ implementation:
       --best_job_name_output_path, {outputPath: best_job_name},
       --best_hyperparameters_output_path, {outputPath: best_hyperparameters},
       --training_image_output_path, {outputPath: training_image}
-    ]
+    ]
diff --git a/components/aws/sagemaker/model/component.yaml b/components/aws/sagemaker/model/component.yaml
@@ -63,7 +63,7 @@ outputs:
   - {name: model_name,          description: 'The model name SageMaker created'}
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       create_model.py,
@@ -83,4 +83,4 @@ implementation:
       --network_isolation, {inputValue: network_isolation},
       --tags, {inputValue: tags},
       --model_name_output_path, {outputPath: model_name}
-    ]
+    ]
diff --git a/components/aws/sagemaker/process/component.yaml b/components/aws/sagemaker/process/component.yaml
@@ -93,7 +93,7 @@ outputs:
   - {name: output_artifacts,      description: 'A dictionary containing the output S3 artifacts'}
 implementation:
   container:
-    image: amazon/aws-sagemaker-kfp-components:0.7.0
+    image: amazon/aws-sagemaker-kfp-components:0.8.0
     command: ['python3']
     args: [
       process.py,
@@ -121,4 +121,4 @@ implementation:
       --tags, {inputValue: tags},
       --job_name_output_path, {outputPath: job_name},
       --output_artifacts_output_path, {outputPath: output_artifacts}
-    ]
+    ]
diff --git a/components/aws/sagemaker/tests/integration_tests/README.md b/components/aws/sagemaker/tests/integration_tests/README.md
@@ -9,7 +9,7 @@
 
 1. In the following Python script, change the bucket name and run the [`s3_sample_data_creator.py`](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/mnist-kmeans-sagemaker#the-sample-dataset) to create an S3 bucket with the sample mnist dataset in the region where you want to run the tests.
 2. To prepare the dataset for the SageMaker GroundTruth Component test, follow the steps in the `[GroundTruth Sample README](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/ground_truth_pipeline_demo#prep-the-dataset-label-categories-and-ui-template)`.
-3. To prepare the processing script for the SageMaker Processing Component tests, upload the `scripts/kmeans_preprocessing.py` script to your bucket. This can be done by replacing `<my-bucket> with your bucket name and running `aws s3 cp scripts/kmeans_preprocessing.py s3://<my-bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py`
+3. To prepare the processing script for the SageMaker Processing Component tests, upload the `scripts/kmeans_preprocessing.py` script to your bucket. This can be done by replacing `<my-bucket>` with your bucket name and running `aws s3 cp scripts/kmeans_preprocessing.py s3://<my-bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py`
 
 
 ## Step to run integration tests
@@ -22,4 +22,4 @@
     1. Navigate to the root of this github directory.
     1. Run `docker build . -f components/aws/sagemaker/tests/integration_tests/Dockerfile -t amazon/integration_test`
 1. Run the image, injecting your environment variable files:
-    1. Run `docker run --env-file components/aws/sagemaker/tests/integration_tests/.env amazon/integration_test`
+    1. Run `docker run --env-file components/aws/sagemaker/tests/integration_tests/.env amazon/integration_test`
diff --git a/components/aws/sagemaker/tests/integration_tests/component_tests/test_train_component.py b/components/aws/sagemaker/tests/integration_tests/component_tests/test_train_component.py
@@ -13,9 +13,13 @@
         pytest.param(
             "resources/config/simple-mnist-training", marks=pytest.mark.canary_test
         ),
-        pytest.param("resources/config/fsx-mnist-training", marks=pytest.mark.fsx_test),
+        pytest.param(
+            "resources/config/fsx-mnist-training",
+            marks=pytest.mark.fsx_test
+        ),
         "resources/config/spot-sample-pipeline-training",
         "resources/config/assume-role-training",
+        "resources/config/xgboost-mnist-trainingjob-debugger"
     ],
 )
 def test_trainingjob(