-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support "cancelled" and "failed" Shutdown operations [MLG-468] #6627
Conversation
// Deprecated: But maintained here for backwards compatibility | ||
int32 placeholder = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an internal API, so I vote to just break it. Also the comment above this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would agree to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, testing the changes to remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, but I'm not sure I'm the right person to review it.
Maybe get somebody on backend to glance at it real quick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
5789ace
to
7b37726
Compare
Description
MLG-468
Supports sending Shutdown operations that are "cancelled" or "failed". This allows us on the python harness side to send operations like
searcher.Shutdown(failure=True)
inside of a Custom SearcherSearchMethod
to signal to the master that an experiment should be considered failed.Before, we made the assumption that any failures by individual trials in Custom Searches were all equally valid. However in use cases like the Deepspeed Autotuning project, we have special trials that cannot fail or the whole job should be stopped and considered failed. This change would allow the user to control that.
Test Plan
Here is a minimal example of a Custom Searcher that can simulate the new functionality:
test_shutdown.zip
To run, try both:
det experiment create dummy_searcher/no_fail.yaml .
and
det experiment create dummy_searcher/fail.yaml .
Each Custom Searcher should create two experiments, one being the orchestrator of the other agent experiment, which actually runs the prospective trials. The orchestrator will schedule a single trial, and on close, choose to
Shutdown
the agent experiment with a failure state or not depending on the yaml. Only the agent experiment should be notated as "failed".Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket