feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

tayritenour · 2023-04-25T01:12:30Z

Description

Supports sending Shutdown operations that are "cancelled" or "failed". This allows us on the python harness side to send operations like searcher.Shutdown(failure=True) inside of a Custom Searcher SearchMethod to signal to the master that an experiment should be considered failed.

Before, we made the assumption that any failures by individual trials in Custom Searches were all equally valid. However in use cases like the Deepspeed Autotuning project, we have special trials that cannot fail or the whole job should be stopped and considered failed. This change would allow the user to control that.

Test Plan

Here is a minimal example of a Custom Searcher that can simulate the new functionality:
test_shutdown.zip

To run, try both:
det experiment create dummy_searcher/no_fail.yaml .
and
det experiment create dummy_searcher/fail.yaml .

Each Custom Searcher should create two experiments, one being the orchestrator of the other agent experiment, which actually runs the prospective trials. The orchestrator will schedule a single trial, and on close, choose to Shutdown the agent experiment with a failure state or not depending on the yaml. Only the agent experiment should be notated as "failed".

Commentary (optional)

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

rb-determined-ai · 2023-04-25T18:27:00Z

proto/src/determined/experiment/v1/searcher.proto

+  // Deprecated: But maintained here for backwards compatibility
  int32 placeholder = 1;


It's an internal API, so I vote to just break it. Also the comment above this line.

i would agree to remove this.

Makes sense, testing the changes to remove this

rb-determined-ai

This looks fine to me, but I'm not sure I'm the right person to review it.

Maybe get somebody on backend to glance at it real quick?

stoksc

lgtm

tayritenour added User-facing API Change python Pull requests that update Python code go Pull requests that update Go code labels Apr 25, 2023

tayritenour requested review from mpkouznetsov, garrett361 and rb-determined-ai April 25, 2023 01:12

tayritenour requested a review from a team as a code owner April 25, 2023 01:12

tayritenour requested review from eecsliu and removed request for a team April 25, 2023 01:12

cla-bot bot added the cla-signed label Apr 25, 2023

tayritenour force-pushed the MLG-468 branch from 3ee1a2e to 9a24973 Compare April 25, 2023 16:58

tayritenour changed the title ~~[MLG-468] Support "cancelled" and "failed" Shutdown operations~~ Support "cancelled" and "failed" Shutdown operations [MLG-468] Apr 25, 2023

tayritenour changed the title ~~Support "cancelled" and "failed" Shutdown operations [MLG-468]~~ feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] Apr 25, 2023

rb-determined-ai reviewed Apr 25, 2023

View reviewed changes

rb-determined-ai approved these changes Apr 25, 2023

View reviewed changes

tayritenour requested a review from stoksc April 25, 2023 20:00

stoksc approved these changes Apr 25, 2023

View reviewed changes

tayritenour force-pushed the MLG-468 branch 3 times, most recently from 5789ace to 7b37726 Compare April 25, 2023 23:22

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468]

b5c953a

tayritenour force-pushed the MLG-468 branch from 7b37726 to b5c953a Compare April 26, 2023 17:33

tayritenour merged commit f0c8ef1 into determined-ai:main Apr 26, 2023

dannysauer added this to the 0.22.0 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

tayritenour commented Apr 25, 2023 •

edited

Loading

rb-determined-ai Apr 25, 2023

stoksc Apr 25, 2023

tayritenour Apr 25, 2023

rb-determined-ai left a comment

stoksc left a comment

		// Deprecated: But maintained here for backwards compatibility
		int32 placeholder = 1;

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627

Conversation

tayritenour commented Apr 25, 2023 • edited Loading

Description

Test Plan

Commentary (optional)

Checklist

Ticket

rb-determined-ai Apr 25, 2023

Choose a reason for hiding this comment

stoksc Apr 25, 2023

Choose a reason for hiding this comment

tayritenour Apr 25, 2023

Choose a reason for hiding this comment

rb-determined-ai left a comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

tayritenour commented Apr 25, 2023 •

edited

Loading