Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(components): AWS SageMaker - Add optional parameter to allow training component to accept parameters related to Debugger #4283

Merged
merged 13 commits into from
Aug 19, 2020

Conversation

dstnluong
Copy link
Contributor

@dstnluong dstnluong commented Jul 28, 2020

  • Add functionality to accept and process debugger related parameters

  • Updated KFP logs to show debugger rules status and to wait until all rules have finished

  • Added and updated unit tests and integration tests

  • Added a new sample pipeline which uses the newly added parameters with a README that explains how to run the sample

  • Minor typo fixes in various files

Checklist:

  • The title for your pull request (PR) should follow our title convention. Learn more about the pull request title convention used in this repository.

    PR titles examples:

    • fix(frontend): fixes empty page. Fixes #1234
      Use fix to indicate that this PR fixes a bug.
    • feat(backend): configurable service account. Fixes #1234, fixes #1235
      Use feat to indicate that this PR adds a new feature.
    • chore: set up changelog generation tools
      Use chore to indicate that this PR makes some changes that users don't need to know.
    • test: fix CI failure. Part of #1234
      Use part of to indicate that a PR is working on an issue, but shouldn't close the issue when merged.
  • Do you want this pull request (PR) cherry-picked into the current release branch?

    If yes, use one of the following options:

    • (Recommended.) Ask the PR approver to add the cherrypick-approved label to this PR. The release manager adds this PR to the release branch in a batch update.
    • After this PR is merged, create a cherry-pick PR to add these changes to the release branch. (For more information about creating a cherry-pick PR, see the Kubeflow Pipelines release guide.)

@kubeflow-bot
Copy link

This change is Reviewable

@google-cla
Copy link

google-cla bot commented Jul 28, 2020

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link
Contributor

Hi @dstnluong. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dstnluong
Copy link
Contributor Author

/assign @RedbackThomson

…e more succinct, removed hardcoding from integration tests
@google-cla
Copy link

google-cla bot commented Jul 30, 2020

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

…ple README, refactored _utils.py for fstrings and fixed offset for errors
@google-cla
Copy link

google-cla bot commented Aug 4, 2020

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: yes and removed cla: no labels Aug 7, 2020
@google-cla
Copy link

google-cla bot commented Aug 7, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no and removed cla: yes labels Aug 7, 2020
@google-cla
Copy link

google-cla bot commented Aug 7, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@@ -8,6 +8,7 @@
import shutil

from sagemaker.amazon.amazon_estimator import get_image_uri
Copy link
Contributor

@akartsky akartsky Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That deprecated version is for SageMaker SDK 2.0 @akartsky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right we can choose to keep using the old version

for some reason i thought this was added to _utils.py file as a new dependency so i thought of using the latest one https://github.com/kubeflow/pipelines/blob/master/components/aws/sagemaker/common/_utils.py#L39

@dstnluong dstnluong force-pushed the debugcomponent branch 3 times, most recently from ac4f2a5 to 8c95a04 Compare August 18, 2020 04:50

:output Trainingjob-1234567890-RuleName-a1112312
"""
job_name = rule_evaluation_job_arn.split("/")[1].split("-")
Copy link
Contributor

@surajkota surajkota Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule_evaluation_job_arn.split("/")[1] should be sufficient ?

also the prefix might not be always Trainingjob-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to replace the initial "trainingjob" with "Trainingjob" and change the rule name as well. There is an edge case where the rule name contains the "-" character, just something to note.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this method is no longer needed

sagemaker==2.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RedbackThomson can an improvement to reduce duplication be part of your PR ?

Dustin had to change sagemaker version at multiple places

@surajkota
Copy link
Contributor

/ok-to-test
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: surajkota

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@akartsky
Copy link
Contributor

/retest

@akartsky
Copy link
Contributor

/retry

@dstnluong
Copy link
Contributor Author

/retest

@akartsky
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 3ebd075 into kubeflow:master Aug 19, 2020
Jeffwan pushed a commit to Jeffwan/pipelines that referenced this pull request Dec 9, 2020
…ining component to accept parameters related to Debugger (kubeflow#4283)

* Implemented debugger for training component with sample pipeline, unit tests, and integration test

* Implemented changes from PR, refactored utils.py, made sample pipeline more succinct, removed hardcoding from integration tests

* Added default parameter for sample pipeline and fixed grammar for sample README, refactored _utils.py for fstrings and fixed offset for errors

* Removed aws secret lines

* Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0.

* Terminate debug rules when terminating training job, Terminate debug rules if terminate is pressed after training job has completed, added integration tests for stop_debug_rules, updated READMEs for train and sample, renamed sample pipeline, removed tensorboard, updated sagemaker version to sagemaker 2.1.0.

* Removed extra files, cleaned integration test

* Changed integration test to use sample debugger pipeline

* Processing jobs created from debug rules will not terminate, fixing other small issues

* Removed debug from pipeline definition, removed extra line, removed unused function

* Changelog and image tag updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants