Skip to content

Commit

Permalink
[AWS SageMaker] Processing job component (#3944)
Browse files Browse the repository at this point in the history
* Add TDD processing definition

* Update README

* Update temporary image

* Update component entrypoint

* Add WORKDIR to fix Docker 18 support

* integration test for processing job

* Remove job links

* Add container outputs and tests

* Update default properties

* Remove max_run_time if none provided

* Update integration readme steps

* Updated README with more resources

* Add CloudWatch link back to logs

* Update input and output config to arrays

* Update processing integration test

* Update process README

* Update unit tests

* Updated license version

* Update component image versions

* Update changelog

Co-authored-by: Suraj Kota <surakota@amazon.com>
  • Loading branch information
RedbackThomson and surajkota committed Jun 17, 2020
1 parent 53d0244 commit bea6365
Show file tree
Hide file tree
Showing 22 changed files with 880 additions and 10 deletions.
13 changes: 12 additions & 1 deletion components/aws/sagemaker/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,21 @@ The version of the AWS SageMaker Components is determined by the docker image ta
Repository: https://hub.docker.com/repository/docker/amazon/aws-sagemaker-kfp-components

---------------------------------------------
**Change log for version 0.4.0**
- Add new component for SageMaker Processing Jobs

> Pull requests : [#3944](https://github.com/kubeflow/pipelines/pull/3944)

**Change log for version 0.3.1**
- Explicitly specify component field types

> Pull requests : [#3683](https://github.com/kubeflow/pipelines/pull/3683)

**Change log for version 0.3.0**
- Remove data_location parameters from all components
(Use "channes" parameter instead)
(Use "channels" parameter instead)

> Pull requests : [#3518](https://github.com/kubeflow/pipelines/pull/3518)
Expand Down
5 changes: 4 additions & 1 deletion components/aws/sagemaker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,21 @@ RUN yum update -y \
unzip

RUN pip3 install \
boto3==1.12.33 \
boto3==1.13.19 \
sagemaker==1.54.0 \
pathlib2==2.3.5 \
pyyaml==3.12

WORKDIR /app

COPY LICENSE.txt .
COPY NOTICE.txt .
COPY THIRD-PARTY-LICENSES.txt .
COPY hyperparameter_tuning/src/hyperparameter_tuning.py .
COPY train/src/train.py .
COPY deploy/src/deploy.py .
COPY model/src/create_model.py .
COPY process/src/process.py .
COPY batch_transform/src/batch_transform.py .
COPY workteam/src/workteam.py .
COPY ground_truth/src/ground_truth.py .
Expand Down
2 changes: 1 addition & 1 deletion components/aws/sagemaker/THIRD-PARTY-LICENSES.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
** Amazon SageMaker Components for Kubeflow Pipelines; version 0.3.1 --
** Amazon SageMaker Components for Kubeflow Pipelines; version 0.4.0 --
https://github.com/kubeflow/pipelines/tree/master/components/aws/sagemaker
Copyright 2019-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
** boto3; version 1.12.33 -- https://github.com/boto/boto3/
Expand Down
2 changes: 1 addition & 1 deletion components/aws/sagemaker/batch_transform/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ outputs:
- {name: output_location, description: 'S3 URI of the transform job results.'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.3.1
image: amazon/aws-sagemaker-kfp-components:0.4.0
command: ['python3']
args: [
batch_transform.py,
Expand Down
110 changes: 110 additions & 0 deletions components/aws/sagemaker/common/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -861,6 +861,116 @@ def enable_spot_instance_support(training_job_config, args):
del training_job_config['StoppingCondition']['MaxWaitTimeInSeconds']
del training_job_config['CheckpointConfig']

def create_processing_job_request(args):
### Documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job
with open(os.path.join(__cwd__, 'process.template.yaml'), 'r') as f:
request = yaml.safe_load(f)

job_name = args['job_name'] if args['job_name'] else 'ProcessingJob-' + strftime("%Y%m%d%H%M%S", gmtime()) + '-' + id_generator()

request['ProcessingJobName'] = job_name
request['RoleArn'] = args['role']

### Update processing container settings
request['AppSpecification']['ImageUri'] = args['image']

if args['container_entrypoint']:
request['AppSpecification']['ContainerEntrypoint'] = args['container_entrypoint']
else:
request['AppSpecification'].pop('ContainerEntrypoint')
if args['container_arguments']:
request['AppSpecification']['ContainerArguments'] = args['container_arguments']
else:
request['AppSpecification'].pop('ContainerArguments')

### Update or pop VPC configs
if args['vpc_security_group_ids'] and args['vpc_subnets']:
request['NetworkConfig']['VpcConfig']['SecurityGroupIds'] = args['vpc_security_group_ids'].split(',')
request['NetworkConfig']['VpcConfig']['Subnets'] = args['vpc_subnets'].split(',')
else:
request['NetworkConfig'].pop('VpcConfig')
request['NetworkConfig']['EnableNetworkIsolation'] = args['network_isolation']
request['NetworkConfig']['EnableInterContainerTrafficEncryption'] = args['traffic_encryption']

### Update input channels, not a required field
if args['input_config']:
request['ProcessingInputs'] = args['input_config']
else:
request.pop('ProcessingInputs')

### Update output channels, must have at least one specified
if len(args['output_config']) > 0:
request['ProcessingOutputConfig']['Outputs'] = args['output_config']
else:
logging.error("Must specify at least one output channel.")
raise Exception('Could not create job request')

if args['output_encryption_key']:
request['ProcessingOutputConfig']['KmsKeyId'] = args['output_encryption_key']
else:
request['ProcessingOutputConfig'].pop('KmsKeyId')

### Update cluster config resources
request['ProcessingResources']['ClusterConfig']['InstanceType'] = args['instance_type']
request['ProcessingResources']['ClusterConfig']['InstanceCount'] = args['instance_count']
request['ProcessingResources']['ClusterConfig']['VolumeSizeInGB'] = args['volume_size']

if args['resource_encryption_key']:
request['ProcessingResources']['ClusterConfig']['VolumeKmsKeyId'] = args['resource_encryption_key']
else:
request['ProcessingResources']['ClusterConfig'].pop('VolumeKmsKeyId')

if args['max_run_time']:
request['StoppingCondition']['MaxRuntimeInSeconds'] = args['max_run_time']
else:
request['StoppingCondition']['MaxRuntimeInSeconds'].pop('max_run_time')

request['Environment'] = args['environment']

### Update tags
for key, val in args['tags'].items():
request['Tags'].append({'Key': key, 'Value': val})

return request


def create_processing_job(client, args):
"""Create a SageMaker processing job."""
request = create_processing_job_request(args)
try:
client.create_processing_job(**request)
processing_job_name = request['ProcessingJobName']
logging.info("Created Processing Job with name: " + processing_job_name)
logging.info("CloudWatch logs: https://{}.console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix"
.format(args['region'], args['region'], processing_job_name))
return processing_job_name
except ClientError as e:
raise Exception(e.response['Error']['Message'])


def wait_for_processing_job(client, processing_job_name, poll_interval=30):
while(True):
response = client.describe_processing_job(ProcessingJobName=processing_job_name)
status = response['ProcessingJobStatus']
if status == 'Completed':
logging.info("Processing job ended with status: " + status)
break
if status == 'Failed':
message = response['FailureReason']
logging.info('Processing failed with the following error: {}'.format(message))
raise Exception('Processing job failed')
logging.info("Processing job is still in status: " + status)
time.sleep(poll_interval)

def get_processing_job_outputs(client, processing_job_name):
"""Map the S3 outputs of a processing job to a dictionary object."""
response = client.describe_processing_job(ProcessingJobName=processing_job_name)
outputs = {}
for output in response['ProcessingOutputConfig']['Outputs']:
outputs[output['OutputName']] = output['S3Output']['S3Uri']

return outputs


def id_generator(size=4, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
Expand Down
26 changes: 26 additions & 0 deletions components/aws/sagemaker/common/process.template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
ProcessingJobName: ''
ProcessingInputs: []
ProcessingOutputConfig:
Outputs: []
KmsKeyId: ''
RoleArn: ''
ProcessingResources:
ClusterConfig:
InstanceType: ''
InstanceCount: 1
VolumeSizeInGB: 1
VolumeKmsKeyId: ''
NetworkConfig:
EnableInterContainerTrafficEncryption: False
EnableNetworkIsolation: False
VpcConfig:
SecurityGroupIds: []
Subnets: []
StoppingCondition:
MaxRuntimeInSeconds: 86400
AppSpecification:
ImageUri: ''
ContainerEntrypoint: []
ContainerArguments: []
Environment: {}
Tags: []
2 changes: 1 addition & 1 deletion components/aws/sagemaker/deploy/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ outputs:
- {name: endpoint_name, description: 'Endpoint name'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.3.1
image: amazon/aws-sagemaker-kfp-components:0.4.0
command: ['python3']
args: [
deploy.py,
Expand Down
2 changes: 1 addition & 1 deletion components/aws/sagemaker/ground_truth/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ outputs:
- {name: active_learning_model_arn, description: 'The ARN for the most recent Amazon SageMaker model trained as part of automated data labeling.'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.3.1
image: amazon/aws-sagemaker-kfp-components:0.4.0
command: ['python3']
args: [
ground_truth.py,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ outputs:
description: 'The registry path of the Docker image that contains the training algorithm'
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.3.1
image: amazon/aws-sagemaker-kfp-components:0.4.0
command: ['python3']
args: [
hyperparameter_tuning.py,
Expand Down
2 changes: 1 addition & 1 deletion components/aws/sagemaker/model/component.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ outputs:
- {name: model_name, description: 'The model name Sagemaker created'}
implementation:
container:
image: amazon/aws-sagemaker-kfp-components:0.3.1
image: amazon/aws-sagemaker-kfp-components:0.4.0
command: ['python3']
args: [
create_model.py,
Expand Down
80 changes: 80 additions & 0 deletions components/aws/sagemaker/process/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# SageMaker Processing Kubeflow Pipelines component

## Summary
Component to submit SageMaker Processing jobs directly from a Kubeflow Pipelines workflow.
https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html

## Intended Use
For running your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation using AWS SageMaker.

## Runtime Arguments
Argument | Description | Optional | Data type | Accepted values | Default |
:--- | :---------- | :----------| :----------| :---------- | :----------|
region | The region where the cluster launches | No | String | | |
endpoint_url | The endpoint URL for the private link VPC endpoint. | Yes | String | | |
job_name | The name of the Processing job. Must be unique within the same AWS account and AWS region | Yes | String | | ProcessingJob-[datetime]-[random id]|
role | The Amazon Resource Name (ARN) that Amazon SageMaker assumes to perform tasks on your behalf | No | String | | |
image | The registry path of the Docker image that contains the processing script | Yes | String | | |
instance_type | The ML compute instance type | Yes | String | ml.m4.xlarge, ml.m4.2xlarge, ml.m4.4xlarge, ml.m4.10xlarge, ml.m4.16xlarge, ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ml.m5.12xlarge, ml.m5.24xlarge, ml.c4.xlarge, ml.c4.2xlarge, ml.c4.4xlarge, ml.c4.8xlarge, ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge, ml.c5.xlarge, ml.c5.2xlarge, ml.c5.4xlarge, ml.c5.9xlarge, ml.c5.18xlarge [and many more](https://aws.amazon.com/sagemaker/pricing/instance-types/) | ml.m4.xlarge |
instance_count | The number of ML compute instances to use in each processing job | Yes | Int | ≥ 1 | 1 |
volume_size | The size of the ML storage volume that you want to provision in GB | Yes | Int | ≥ 1 | 30 |
resource_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt data on the storage volume attached to the ML compute instance(s) | Yes | String | | |
output_encryption_key | The AWS KMS key that Amazon SageMaker uses to encrypt the model artifacts | Yes | String | | |
max_run_time | The maximum run time in seconds per processing job | Yes | Int | ≤ 432000 (5 days) | 86400 (1 day) |
environment | The environment variables to set in the Docker container | Yes | Yes | Dict | Maximum length of 1024. Key Pattern: `[a-zA-Z_][a-zA-Z0-9_]*`. Value Pattern: `[\S\s]*`. Upto 16 key and values entries in the map | |
container_entrypoint | The entrypoint for the processing job. This is in the form of a list of strings that make a command | Yes | Yes | List of Strings | | [] |
container_arguments | A list of string arguments to be passed to a processing job | Yes | Yes | List of Strings | | [] |
input_config | Parameters that specify Amazon S3 inputs for a processing job | No | List of Dicts | | [] |
output_config | Parameters that specify Amazon S3 outputs for a processing job | No | List of Dict | | [] |
vpc_security_group_ids | A comma-delimited list of security group IDs, in the form sg-xxxxxxxx | Yes | String | | |
vpc_subnets | A comma-delimited list of subnet IDs in the VPC to which you want to connect your hpo job | Yes | String | | |
network_isolation | Isolates the processing container if true | No | Boolean | False, True | True |
traffic_encryption | Encrypts all communications between ML compute instances in distributed processing if true | No | Boolean | False, True | False |
tags | Key-value pairs to categorize AWS resources | Yes | Dict | | {} |

Notes:
* You can find more information about how container entrypoint and arguments are used at the [Build Your Own Processing Container](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html#byoc-run-image) documentation.
* Each key and value in the `environment` parameter string to string map can have length of up to 1024. SageMaker supports up to 16 entries in the map.
* The format for the [`input_config`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html) field is:
```
[
{
'InputName': 'string',
'S3Input': {
'S3Uri': 'string',
'LocalPath': 'string',
'S3DataType': 'ManifestFile'|'S3Prefix',
'S3InputMode': 'Pipe'|'File',
'S3DataDistributionType': 'FullyReplicated'|'ShardedByS3Key',
'S3CompressionType': 'None'|'Gzip'
}
},
]
```
* The format for the [`output_config`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingS3Output.html) field is:
```
[
{
'OutputName': 'string',
'S3Output': {
'S3Uri': 'string',
'LocalPath': 'string',
'S3UploadMode': 'Continuous'|'EndOfJob'
}
},
]
```

## Outputs
Name | Description
:--- | :----------
job_name | Processing job name
output_artifacts | A dictionary mapping with `output_config` `OutputName` as the key and `S3Uri` as the value

## Requirements
* [Kubeflow pipelines SDK](https://www.kubeflow.org/docs/pipelines/sdk/install-sdk/)
* [Kubeflow set-up](https://www.kubeflow.org/docs/aws/deploy/install-kubeflow/)

## Resources
* [Create Processing Job API documentation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html)
* [Boto3 API reference](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_processing_job)
Loading

0 comments on commit bea6365

Please sign in to comment.