Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

Merged
merged 1 commit into from
Apr 26, 2023

Conversation

tayritenour
Copy link
Contributor

@tayritenour tayritenour commented Apr 25, 2023

Description

MLG-468

Supports sending Shutdown operations that are "cancelled" or "failed". This allows us on the python harness side to send operations like searcher.Shutdown(failure=True) inside of a Custom Searcher SearchMethod to signal to the master that an experiment should be considered failed.

Before, we made the assumption that any failures by individual trials in Custom Searches were all equally valid. However in use cases like the Deepspeed Autotuning project, we have special trials that cannot fail or the whole job should be stopped and considered failed. This change would allow the user to control that.

Test Plan

Here is a minimal example of a Custom Searcher that can simulate the new functionality:
test_shutdown.zip

To run, try both:
det experiment create dummy_searcher/no_fail.yaml .
and
det experiment create dummy_searcher/fail.yaml .

Screenshot 2023-04-24 at 5 52 36 PM

Each Custom Searcher should create two experiments, one being the orchestrator of the other agent experiment, which actually runs the prospective trials. The orchestrator will schedule a single trial, and on close, choose to Shutdown the agent experiment with a failure state or not depending on the yaml. Only the agent experiment should be notated as "failed".

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@tayritenour tayritenour added User-facing API Change python Pull requests that update Python code go Pull requests that update Go code labels Apr 25, 2023
@tayritenour tayritenour requested a review from a team as a code owner April 25, 2023 01:12
@tayritenour tayritenour requested review from eecsliu and removed request for a team April 25, 2023 01:12
@cla-bot cla-bot bot added the cla-signed label Apr 25, 2023
@tayritenour tayritenour changed the title [MLG-468] Support "cancelled" and "failed" Shutdown operations Support "cancelled" and "failed" Shutdown operations [MLG-468] Apr 25, 2023
@tayritenour tayritenour changed the title Support "cancelled" and "failed" Shutdown operations [MLG-468] feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] Apr 25, 2023
Comment on lines 168 to 169
// Deprecated: But maintained here for backwards compatibility
int32 placeholder = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an internal API, so I vote to just break it. Also the comment above this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would agree to remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, testing the changes to remove this

Copy link
Contributor

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, but I'm not sure I'm the right person to review it.

Maybe get somebody on backend to glance at it real quick?

@tayritenour tayritenour requested a review from stoksc April 25, 2023 20:00
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tayritenour tayritenour force-pushed the MLG-468 branch 3 times, most recently from 5789ace to 7b37726 Compare April 25, 2023 23:22
@tayritenour tayritenour merged commit f0c8ef1 into determined-ai:main Apr 26, 2023
@dannysauer dannysauer added this to the 0.22.0 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed go Pull requests that update Go code python Pull requests that update Python code User-facing API Change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants