Skip to content

Commit

Permalink
feat(components): AWS SageMaker - Add optional parameter to allow tra…
Browse files Browse the repository at this point in the history
…ining component to accept parameters related to Debugger (#4283)

* Implemented debugger for training component with sample pipeline, unit tests, and integration test

* Implemented changes from PR, refactored utils.py, made sample pipeline more succinct, removed hardcoding from integration tests

* Added default parameter for sample pipeline and fixed grammar for sample README, refactored _utils.py for fstrings and fixed offset for errors

* Removed aws secret lines

* Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0.

* Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0.

* Removed extra files, cleaned integration test

* Changed integration test to use sample debugger pipeline

* Processing jobs created from debug rules will not terminate, fixing other small issues

* Removed debug from pipeline definition, removed extra line, removed unused function

* Changelog and image tag updates
  • Loading branch information
dstnluong committed Aug 19, 2020
1 parent f50fc0d commit 3ebd075
Show file tree
Hide file tree
Showing 28 changed files with 493 additions and 88 deletions.
8 changes: 7 additions & 1 deletion components/aws/sagemaker/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ The version of the AWS SageMaker Components is determined by the docker image ta
Repository: https://hub.docker.com/repository/docker/amazon/aws-sagemaker-kfp-components

---------------------------------------------
**Change log for version 0.8.0**
- Add functionality to configure SageMaker Debugger for Training component

> Pull requests " [#4283](https://github.com/kubeflow/pipelines/pull/4283/)

**Change log for version 0.7.0**
- Add functionality to assume role when sending SageMaker requests

Expand All @@ -29,7 +35,7 @@ Repository: https://hub.docker.com/repository/docker/amazon/aws-sagemaker-kfp-c

**Change log for version 0.5.1**
- Update region support for GroudTruth component
- Update region support for GroundTruth component
- Make `label_category_config` an optional parameter in Ground Truth component

> Pull requests : [#3932](https://github.com/kubeflow/pipelines/pull/3932)
Expand Down
4 changes: 2 additions & 2 deletions components/aws/sagemaker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ RUN yum update -y \
unzip

RUN pip3 install \
boto3==1.13.19 \
sagemaker==1.54.0 \
boto3==1.14.12 \
sagemaker==2.1.0 \
pathlib2==2.3.5 \
pyyaml==3.12

Expand Down
8 changes: 4 additions & 4 deletions components/aws/sagemaker/THIRD-PARTY-LICENSES.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
** Amazon SageMaker Components for Kubeflow Pipelines; version 0.7.0 --
** Amazon SageMaker Components for Kubeflow Pipelines; version 0.8.0 --
https://github.com/kubeflow/pipelines/tree/master/components/aws/sagemaker
Copyright 2019-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
** boto3; version 1.12.33 -- https://github.com/boto/boto3/
** boto3; version 1.14.12 -- https://github.com/boto/boto3/
Copyright 2013-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
** botocore; version 1.15.33 -- https://github.com/boto/botocore
Botocore
Expand All @@ -12,7 +12,7 @@ https://importlib-metadata.readthedocs.io/en/latest/
** s3transfer; version 0.3.3 -- https://github.com/boto/s3transfer/
s3transfer
Copyright 2016 Amazon.com, Inc. or its affiliates. All Rights Reserved.
** sagemaker; version 1.54.0 -- https://aws.amazon.com/sagemaker/
** sagemaker; version 2.1.0 -- https://aws.amazon.com/sagemaker/
Amazon SageMaker Python SDK
Copyright 2017-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
** smdebug-rulesconfig; version 0.1.2 --
Expand Down Expand Up @@ -982,4 +982,4 @@ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to <http://unlicense.org/>
For more information, please refer to <http://unlicense.org/>
2 changes: 1 addition & 1 deletion components/aws/sagemaker/batch_transform/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ outputs:
- {name: output_location, description: 'S3 URI of the transform job results.'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
batch_transform.py,
Expand Down
138 changes: 116 additions & 22 deletions components/aws/sagemaker/common/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import re
import json
from pathlib2 import Path
from enum import Enum, auto

import boto3
from boto3.session import Session
Expand All @@ -36,7 +37,7 @@
from botocore.exceptions import ClientError
from botocore.session import Session as BotocoreSession

from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.image_uris import retrieve

import logging
logging.getLogger().setLevel(logging.INFO)
Expand Down Expand Up @@ -99,6 +100,9 @@ def get_component_version():
return component_version


def print_log_header(header_len, title=""):
logging.info(f"{title:*^{header_len}}")

def print_logs_for_job(cw_client, log_grp, job_name):
"""Gets the CloudWatch logs for SageMaker jobs"""
try:
Expand Down Expand Up @@ -206,12 +210,12 @@ def create_training_job_request(args):
# TODO: Adjust this implementation to account for custom algorithm resources names that are the same as built-in algorithm names
algo_name = args['algorithm_name'].lower().strip()
if algo_name in built_in_algos.keys():
request['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], built_in_algos[algo_name])
request['AlgorithmSpecification']['TrainingImage'] = retrieve(built_in_algos[algo_name], args['region'])
request['AlgorithmSpecification'].pop('AlgorithmName')
logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
# Just to give the user more leeway for built-in algorithm name inputs
elif algo_name in built_in_algos.values():
request['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], algo_name)
request['AlgorithmSpecification']['TrainingImage'] = retrieve(algo_name, args['region'])
request['AlgorithmSpecification'].pop('AlgorithmName')
logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
else:
Expand Down Expand Up @@ -258,6 +262,17 @@ def create_training_job_request(args):

enable_spot_instance_support(request, args)

### Update DebugHookConfig and DebugRuleConfigurations
if args['debug_hook_config']:
request['DebugHookConfig'] = args['debug_hook_config']
else:
request.pop('DebugHookConfig')

if args['debug_rule_config']:
request['DebugRuleConfigurations'] = args['debug_rule_config']
else:
request.pop('DebugRuleConfigurations')

### Update tags
for key, val in args['tags'].items():
request['Tags'].append({'Key': key, 'Value': val})
Expand All @@ -282,18 +297,94 @@ def create_training_job(client, args):


def wait_for_training_job(client, training_job_name, poll_interval=30):
while(True):
response = client.describe_training_job(TrainingJobName=training_job_name)
status = response['TrainingJobStatus']
if status == 'Completed':
logging.info("Training job ended with status: " + status)
break
if status == 'Failed':
message = response['FailureReason']
logging.info('Training failed with the following error: {}'.format(message))
raise Exception('Training job failed')
logging.info("Training job is still in status: " + status)
time.sleep(poll_interval)
while(True):
response = client.describe_training_job(TrainingJobName=training_job_name)
status = response['TrainingJobStatus']
if status == 'Completed':
logging.info("Training job ended with status: " + status)
break
if status == 'Failed':
message = response['FailureReason']
logging.info(f'Training failed with the following error: {message}')
raise Exception('Training job failed')
logging.info("Training job is still in status: " + status)
time.sleep(poll_interval)


def wait_for_debug_rules(client, training_job_name, poll_interval=30):
first_poll = True
while(True):
response = client.describe_training_job(TrainingJobName=training_job_name)
if 'DebugRuleEvaluationStatuses' not in response:
break
if first_poll:
logging.info("Polling for status of all debug rules:")
first_poll = False
if DebugRulesStatus.from_describe(response) != DebugRulesStatus.INPROGRESS:
logging.info("Rules have ended with status:\n")
print_debug_rule_status(response, True)
break
print_debug_rule_status(response)
time.sleep(poll_interval)


class DebugRulesStatus(Enum):
COMPLETED = auto()
ERRORED = auto()
INPROGRESS = auto()

@classmethod
def from_describe(self, response):
has_error = False
for debug_rule in response['DebugRuleEvaluationStatuses']:
if debug_rule['RuleEvaluationStatus'] == "Error":
has_error = True
if debug_rule['RuleEvaluationStatus'] == "InProgress":
return DebugRulesStatus.INPROGRESS
if has_error:
return DebugRulesStatus.ERRORED
else:
return DebugRulesStatus.COMPLETED


def print_debug_rule_status(response, last_print=False):
"""
Example of DebugRuleEvaluationStatuses:
response['DebugRuleEvaluationStatuses'] =
[{
"RuleConfigurationName": "VanishingGradient",
"RuleEvaluationStatus": "IssuesFound",
"StatusDetails": "There was an issue."
}]
If last_print is False:
INFO:root: - LossNotDecreasing: InProgress
INFO:root: - Overtraining: NoIssuesFound
ERROR:root:- CustomGradientRule: Error
If last_print is True:
INFO:root: - LossNotDecreasing: IssuesFound
INFO:root: - RuleEvaluationConditionMet: Evaluation of the rule LossNotDecreasing at step 10 resulted in the condition being met
"""
for debug_rule in response['DebugRuleEvaluationStatuses']:
line_ending = "\n" if last_print else ""
if 'StatusDetails' in debug_rule:
status_details = f"- {debug_rule['StatusDetails'].rstrip()}{line_ending}"
line_ending = ""
else:
status_details = ""
rule_status = f"- {debug_rule['RuleConfigurationName']}: {debug_rule['RuleEvaluationStatus']}{line_ending}"
if debug_rule['RuleEvaluationStatus'] == "Error":
log = logging.error
status_padding = 1
else:
log = logging.info
status_padding = 2

log(f"{status_padding * ' '}{rule_status}")
if last_print and status_details:
log(f"{(status_padding + 2) * ' '}{status_details}")
print_log_header(50)


def get_model_artifacts_from_job(client, job_name):
Expand All @@ -314,10 +405,13 @@ def get_image_from_job(client, job_name):


def stop_training_job(client, job_name):
try:
client.stop_training_job(TrainingJobName=job_name)
except ClientError as e:
raise Exception(e.response['Error']['Message'])
response = client.describe_training_job(TrainingJobName=job_name)
if response["TrainingJobStatus"] == "InProgress":
try:
client.stop_training_job(TrainingJobName=job_name)
return job_name
except ClientError as e:
raise Exception(e.response['Error']['Message'])


def create_model(client, args):
Expand Down Expand Up @@ -611,12 +705,12 @@ def create_hyperparameter_tuning_job_request(args):
# TODO: Adjust this implementation to account for custom algorithm resources names that are the same as built-in algorithm names
algo_name = args['algorithm_name'].lower().strip()
if algo_name in built_in_algos.keys():
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], built_in_algos[algo_name])
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = retrieve(built_in_algos[algo_name], args['region'])
request['TrainingJobDefinition']['AlgorithmSpecification'].pop('AlgorithmName')
logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
# To give the user more leeway for built-in algorithm name inputs
elif algo_name in built_in_algos.values():
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = get_image_uri(args['region'], algo_name)
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingImage'] = retrieve(algo_name, args['region'])
request['TrainingJobDefinition']['AlgorithmSpecification'].pop('AlgorithmName')
logging.warning('Algorithm name is found as an Amazon built-in algorithm. Using built-in algorithm.')
else:
Expand Down Expand Up @@ -1135,4 +1229,4 @@ def write_output(output_path, output_value, json_encode=False):
write_value = json.dumps(output_value) if json_encode else output_value

Path(output_path).parent.mkdir(parents=True, exist_ok=True)
Path(output_path).write_text(write_value)
Path(output_path).write_text(write_value)
4 changes: 3 additions & 1 deletion components/aws/sagemaker/common/train.template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@ VpcConfig:
StoppingCondition:
MaxRuntimeInSeconds: 86400
MaxWaitTimeInSeconds: 86400
DebugHookConfig: {}
DebugRuleConfigurations: []
CheckpointConfig:
S3Uri: ''
LocalPath: ''
Tags: []
EnableNetworkIsolation: True
EnableInterContainerTrafficEncryption: False
EnableManagedSpotTraining: False
EnableManagedSpotTraining: False
2 changes: 1 addition & 1 deletion components/aws/sagemaker/deploy/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ outputs:
- {name: endpoint_name, description: 'Endpoint name'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
deploy.py,
Expand Down
4 changes: 2 additions & 2 deletions components/aws/sagemaker/ground_truth/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ outputs:
- {name: active_learning_model_arn, description: 'The ARN for the most recent Amazon SageMaker model trained as part of automated data labeling.'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
ground_truth.py,
Expand Down Expand Up @@ -161,4 +161,4 @@ implementation:
--tags, {inputValue: tags},
--output_manifest_location_output_path, {outputPath: output_manifest_location},
--active_learning_model_arn_output_path, {outputPath: active_learning_model_arn}
]
]
4 changes: 2 additions & 2 deletions components/aws/sagemaker/hyperparameter_tuning/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ outputs:
description: 'The registry path of the Docker image that contains the training algorithm'
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
hyperparameter_tuning.py,
Expand Down Expand Up @@ -200,4 +200,4 @@ implementation:
--best_job_name_output_path, {outputPath: best_job_name},
--best_hyperparameters_output_path, {outputPath: best_hyperparameters},
--training_image_output_path, {outputPath: training_image}
]
]
4 changes: 2 additions & 2 deletions components/aws/sagemaker/model/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ outputs:
- {name: model_name, description: 'The model name SageMaker created'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
create_model.py,
Expand All @@ -83,4 +83,4 @@ implementation:
--network_isolation, {inputValue: network_isolation},
--tags, {inputValue: tags},
--model_name_output_path, {outputPath: model_name}
]
]
4 changes: 2 additions & 2 deletions components/aws/sagemaker/process/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ outputs:
- {name: output_artifacts, description: 'A dictionary containing the output S3 artifacts'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.7.0
image: amazon/aws-sagemaker-kfp-components:0.8.0
command: ['python3']
args: [
process.py,
Expand Down Expand Up @@ -121,4 +121,4 @@ implementation:
--tags, {inputValue: tags},
--job_name_output_path, {outputPath: job_name},
--output_artifacts_output_path, {outputPath: output_artifacts}
]
]
4 changes: 2 additions & 2 deletions components/aws/sagemaker/tests/integration_tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

1. In the following Python script, change the bucket name and run the [`s3_sample_data_creator.py`](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/mnist-kmeans-sagemaker#the-sample-dataset) to create an S3 bucket with the sample mnist dataset in the region where you want to run the tests.
2. To prepare the dataset for the SageMaker GroundTruth Component test, follow the steps in the `[GroundTruth Sample README](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/ground_truth_pipeline_demo#prep-the-dataset-label-categories-and-ui-template)`.
3. To prepare the processing script for the SageMaker Processing Component tests, upload the `scripts/kmeans_preprocessing.py` script to your bucket. This can be done by replacing `<my-bucket> with your bucket name and running `aws s3 cp scripts/kmeans_preprocessing.py s3://<my-bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py`
3. To prepare the processing script for the SageMaker Processing Component tests, upload the `scripts/kmeans_preprocessing.py` script to your bucket. This can be done by replacing `<my-bucket>` with your bucket name and running `aws s3 cp scripts/kmeans_preprocessing.py s3://<my-bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py`


## Step to run integration tests
Expand All @@ -22,4 +22,4 @@
1. Navigate to the root of this github directory.
1. Run `docker build . -f components/aws/sagemaker/tests/integration_tests/Dockerfile -t amazon/integration_test`
1. Run the image, injecting your environment variable files:
1. Run `docker run --env-file components/aws/sagemaker/tests/integration_tests/.env amazon/integration_test`
1. Run `docker run --env-file components/aws/sagemaker/tests/integration_tests/.env amazon/integration_test`
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,13 @@
pytest.param(
"resources/config/simple-mnist-training", marks=pytest.mark.canary_test
),
pytest.param("resources/config/fsx-mnist-training", marks=pytest.mark.fsx_test),
pytest.param(
"resources/config/fsx-mnist-training",
marks=pytest.mark.fsx_test
),
"resources/config/spot-sample-pipeline-training",
"resources/config/assume-role-training",
"resources/config/xgboost-mnist-trainingjob-debugger"
],
)
def test_trainingjob(
Expand Down
Loading

0 comments on commit 3ebd075

Please sign in to comment.