-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Proposal: Sweepable API #5993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Sweepable API #5993
Conversation
Codecov Report
@@ Coverage Diff @@
## main #5993 +/- ##
=======================================
Coverage ? 68.54%
=======================================
Files ? 1146
Lines ? 245450
Branches ? 25637
=======================================
Hits ? 168240
Misses ? 70502
Partials ? 6708
Flags with carried forward coverage won't be shown. Click here to find out more. |
@@ -0,0 +1,162 @@ | |||
# AutoML.Net Sweepable API proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericstj @eerhardt @michaelgsharp @justinormont
Hey All!
This proposal is to bring over the lower layers of our tooling's AutoML. All of the techniques that we worked with MSR to use in tooling (NNI, FLAML) are built on top of this.
@LittleLittleCloud is proposing that we bring over these lower layers first and then bring over the actual NNI+FLAML based AutoML next or upgrade AutoMl.NET to be the NNI+FLAML based approach. I think it's possible we could keep or nearly keep the same AutoMl.NET API.
I'm open to keeping these layers internal if we need to ... but sweeping is a common thing to do in ML world. We've had several community members come up with their own ways to sweep over models and parameters. This would just expose our method of doing it.
The end goal is to only have one ML.NET AutoML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwood803 ( I saw your like ) we'd love to hear your feedback by as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @JakeRadMSFT. I definitely like having this API! This helps make it more on part with scikit-learn that has grid search. Plus, I believe this will help model creators get better models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to keeping these layers internal if we need to
My suggestion would be to start with them internal, and ensure it can meet the existing AutoML scenarios. Then make them public as we go when we have data that supports they are needed to be public.
The end goal is to only have one ML.NET AutoML.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, if we want to start from internal first, a smooth start without breaking change can be search space, either through package reference or source code. The reasons are
- search space has the least dependency. It only depends on newtonsoft, which makes it easy to introduce.
- after introducing, we can use it to create search space for existing trainers in AutoML.Net. Which provides us a chance to improve the performance of AutoML.Net experiments by using larger, better search space in hpo as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a fine plan, including keeping as internal for a bit.
If it fits within your methods, I'd recommend having the default sweeping space { range, scaling, data type } as part of the component being swept (as current params). For AutoML․NET, due to non-great reasons, we kept a duplicate copy. Traveling with the component is more clean since the ranges show up (and disappear) with the availability of the component. For instance, as the user includes the NuGet for FastTree, the ranges are immediately available. In addition the ranges can be created inline with the new component.
Some historical perspective, there is also a command-line sweeper in ML․NET, which may be working (anything untested is assumed non-functional). This does let users sweep an individual (or sets of) components including fully switching out portions of the pipeline. The MAML command-line sweeper was very powerful, but hard to use. Mentioned previously: #5019 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only depends on newtonsoft
If possible, can we use System.Text.Json?
I'd recommend having the default sweeping space { range, scaling, data type } as part of the component being swept (as current params)
I agree that this is a good strategy. It's better to keep the defaults with the component/trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, can we use System.Text.Json?
Currently no because auto-binding experience depends heavily on JToken
. However, there's a plan to remove this dependency so in the future maybe.
```csharp | ||
public class Option | ||
{ | ||
[Range(2, 32768, init: 2, logBase: true)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have a reflection based solution to start? Or would the "weakly-typed" API solve most of the scenarios:
var ss = new SearchSpace();
ss.Add("WindowSize", new UniformIntOption(2, 32768, true, 2));
ss.Add("SeriesLength", new ChoiceOption(2,3,4));
ss.Add("UseSoftmax", new ChoiceOption(true, false));
ss.Add("AnotherOption", ss.Clone())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I ask is: I would not add multiple ways to do something right away. Instead, make the "core" thing first, and the simpler ones can be built later, if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reflection API (let's call it strong-typed correspond with "weakly-typed") is just a handy way for creating search space and it's built on top of "weakly-typed". So if we must make a choice at the beginning I would pick "weakly-typed" to implement/migrate first.
.Append(context.BinaryClassification.Calibrators.Naive(labelColumnName: @"Survived", scoreColumnName: @"Score")); | ||
``` | ||
|
||
After sweepable pipeline is created, one can call `BuildTrainingPipeline` to convert it to a ml.net pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the type of the pipeline be the same as a normal ML.NET pipeline? What happens if someone tries to run it without calling that method first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it won't be the same ML.Net pipeline. And user needs to call BuildTrainingPipeline
with passed parameter to convert it to an ML.Net pipeline and only after that, they can use Fit/Transform
to train their model.
I feel like that BuildTrainingPipeline
might be too detailed an API to be exposed. And it will be better if it can be wrapped and only exposes a Run
method. We can discuss it later though.
Console.WriteLine($"trial {i++}"); | ||
|
||
// convert sweepable pipeline to ml.net pipeline | ||
var trainingPipeline = pipeline.BuildTrainingPipeline(context, param); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, but would there ever be a need to save the current state of the sweeper so that it can be stopped and then resumed at a later date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sweeper supports that then yes. In fact continue training has already been supported in model builder cfo sweeper, it's just not exposed yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for the future, but how about "import experiments"? That would help in case someone wants to distribute training to multiple nodes: a) save each experiment results to db b) before starting next iteration, fetch new from db and import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@torronen
That's a wonderful suggestion, thanks!
|
||
## Current feedback | ||
- the lambda function in `CreateSweepableEstimator` should not accept `MLContext` as first parameter. | ||
- Should provide a higher level API for training similar with Experiment API in AutoML.Net. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that higher level API be part of this though? Wouldn't that belong more in the AutoML side and less with the sweeper side since the sweeper is "lower level" and is expected to be more hands on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup the higher level API will be along with exsiting AutoML Experiment API.
From user perspective this looks great and promising! As I understand pipeline.SearchSpace will automatically keep track of the experiments, their results and parameters. So, basically, we could start each new tuning experiment by first running the best parameters from last experiment, then looping with tuner.Propose. Is it correct? If possible, I'd also like to extract the results from pipeline.SearchSpace to help me get a better understanding of how the tuning affects performance. For example, I might want to plot a graph about how the number of trees affect FastForest to help me find optimal number considering training time vs. accuracy. |
@torronen Quite close but not exactly accurate. Pipeline.SearchSpace will only define the boundary of parameters and and it's stateless. If you want to track parameter and their results under current api, you need to do it yourself. (save each parameter and their result in a dictionary or something..)
That's a great advice. It's also something that we want to heavily dive into on how to help users understand training process, like trend line of metric, importance of each parameter, etc etc... |
@LittleLittleCloud ok I see. I assume Tuner needs to keep the results. Something in general, the sweeping process is running through a loop. So it's very flexible on how you perform training and you can customize it into whatever way you want, like hold-out strategy, metric, or even training on multiple thread/on cloud...
Not sure if these are valid for this plan, but my next questions might be:
optimization metric is a number calculated from predict and truth value so as long as you have access to predict and truth value (which is available) you can calculate whatever metric you want.
If ML.Net built-in optimization metric can't meet your request, you can define your own metric and calculate by yourself as long as you have predict and truth value. ML.Net
You can implement On a higher level API it will be not that easy to customize the entire training process. A fluent-style will be preferred here
But you'll still have access to all low-level api so there'll be a way to customize the training. |
@torronen reply inline |
@LittleLittleCloud BTW, does this proposal and search space API allow optimizing the booster in LightGBM? Or, should users run the tuning for DART, Gradient and Goss separately is they expect it to matter? Asking because it is an object, not value type, and the object has parameters inside it. Or, maybe user can write his own if...else or switch depending on what the tuner wants to run next? |
Yes you can, it really doesn't matter whether the property you want to sweep on is a value type or not, you can aways pass a factory method to create the trainer based on parameter sweeped from search space. |
@LittleLittleCloud Merging this in as the sweepable estimator itself is already checked in. |
This is the initial design/proposal doc for a Sweepable API.
I would appreciate it if you could each review it and give me your feedback.