Skip to content

Proposal: Sweepable API #5993

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 12, 2022

Conversation

LittleLittleCloud
Copy link
Contributor

@LittleLittleCloud LittleLittleCloud commented Nov 2, 2021

This is the initial design/proposal doc for a Sweepable API.

I would appreciate it if you could each review it and give me your feedback.

@LittleLittleCloud LittleLittleCloud added the AutoML.NET Automating various steps of the machine learning process label Nov 2, 2021
@codecov
Copy link

codecov bot commented Nov 2, 2021

Codecov Report

❗ No coverage uploaded for pull request base (main@b31b7ca). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #5993   +/-   ##
=======================================
  Coverage        ?   68.54%           
=======================================
  Files           ?     1146           
  Lines           ?   245450           
  Branches        ?    25637           
=======================================
  Hits            ?   168240           
  Misses          ?    70502           
  Partials        ?     6708           
Flag Coverage Δ
Debug 68.54% <0.00%> (?)
production 63.32% <0.00%> (?)
test 88.67% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@ -0,0 +1,162 @@
# AutoML.Net Sweepable API proposal
Copy link
Contributor

@JakeRadMSFT JakeRadMSFT Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericstj @eerhardt @michaelgsharp @justinormont

Hey All!
This proposal is to bring over the lower layers of our tooling's AutoML. All of the techniques that we worked with MSR to use in tooling (NNI, FLAML) are built on top of this.

@LittleLittleCloud is proposing that we bring over these lower layers first and then bring over the actual NNI+FLAML based AutoML next or upgrade AutoMl.NET to be the NNI+FLAML based approach. I think it's possible we could keep or nearly keep the same AutoMl.NET API.

I'm open to keeping these layers internal if we need to ... but sweeping is a common thing to do in ML world. We've had several community members come up with their own ways to sweep over models and parameters. This would just expose our method of doing it.

The end goal is to only have one ML.NET AutoML.

Copy link
Contributor

@JakeRadMSFT JakeRadMSFT Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwood803 ( I saw your like ) we'd love to hear your feedback by as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @JakeRadMSFT. I definitely like having this API! This helps make it more on part with scikit-learn that has grid search. Plus, I believe this will help model creators get better models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to keeping these layers internal if we need to

My suggestion would be to start with them internal, and ensure it can meet the existing AutoML scenarios. Then make them public as we go when we have data that supports they are needed to be public.

The end goal is to only have one ML.NET AutoML.

💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, if we want to start from internal first, a smooth start without breaking change can be search space, either through package reference or source code. The reasons are

  • search space has the least dependency. It only depends on newtonsoft, which makes it easy to introduce.
  • after introducing, we can use it to create search space for existing trainers in AutoML.Net. Which provides us a chance to improve the performance of AutoML.Net experiments by using larger, better search space in hpo as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a fine plan, including keeping as internal for a bit.

If it fits within your methods, I'd recommend having the default sweeping space { range, scaling, data type } as part of the component being swept (as current params). For AutoML․NET, due to non-great reasons, we kept a duplicate copy. Traveling with the component is more clean since the ranges show up (and disappear) with the availability of the component. For instance, as the user includes the NuGet for FastTree, the ranges are immediately available. In addition the ranges can be created inline with the new component.

Some historical perspective, there is also a command-line sweeper in ML․NET, which may be working (anything untested is assumed non-functional). This does let users sweep an individual (or sets of) components including fully switching out portions of the pipeline. The MAML command-line sweeper was very powerful, but hard to use. Mentioned previously: #5019 (comment)

Copy link
Member

@eerhardt eerhardt Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only depends on newtonsoft

If possible, can we use System.Text.Json?

I'd recommend having the default sweeping space { range, scaling, data type } as part of the component being swept (as current params)

I agree that this is a good strategy. It's better to keep the defaults with the component/trainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can we use System.Text.Json?

Currently no because auto-binding experience depends heavily on JToken. However, there's a plan to remove this dependency so in the future maybe.

```csharp
public class Option
{
[Range(2, 32768, init: 2, logBase: true)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have a reflection based solution to start? Or would the "weakly-typed" API solve most of the scenarios:

var ss = new SearchSpace();
ss.Add("WindowSize", new UniformIntOption(2, 32768, true, 2));
ss.Add("SeriesLength", new ChoiceOption(2,3,4));
ss.Add("UseSoftmax", new ChoiceOption(true, false));
ss.Add("AnotherOption", ss.Clone())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is: I would not add multiple ways to do something right away. Instead, make the "core" thing first, and the simpler ones can be built later, if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reflection API (let's call it strong-typed correspond with "weakly-typed") is just a handy way for creating search space and it's built on top of "weakly-typed". So if we must make a choice at the beginning I would pick "weakly-typed" to implement/migrate first.

.Append(context.BinaryClassification.Calibrators.Naive(labelColumnName: @"Survived", scoreColumnName: @"Score"));
```

After sweepable pipeline is created, one can call `BuildTrainingPipeline` to convert it to a ml.net pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the type of the pipeline be the same as a normal ML.NET pipeline? What happens if someone tries to run it without calling that method first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it won't be the same ML.Net pipeline. And user needs to call BuildTrainingPipeline with passed parameter to convert it to an ML.Net pipeline and only after that, they can use Fit/Transform to train their model.

I feel like that BuildTrainingPipeline might be too detailed an API to be exposed. And it will be better if it can be wrapped and only exposes a Run method. We can discuss it later though.

Console.WriteLine($"trial {i++}");

// convert sweepable pipeline to ml.net pipeline
var trainingPipeline = pipeline.BuildTrainingPipeline(context, param);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, but would there ever be a need to save the current state of the sweeper so that it can be stopped and then resumed at a later date?

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sweeper supports that then yes. In fact continue training has already been supported in model builder cfo sweeper, it's just not exposed yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for the future, but how about "import experiments"? That would help in case someone wants to distribute training to multiple nodes: a) save each experiment results to db b) before starting next iteration, fetch new from db and import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@torronen
That's a wonderful suggestion, thanks!


## Current feedback
- the lambda function in `CreateSweepableEstimator` should not accept `MLContext` as first parameter.
- Should provide a higher level API for training similar with Experiment API in AutoML.Net.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that higher level API be part of this though? Wouldn't that belong more in the AutoML side and less with the sweeper side since the sweeper is "lower level" and is expected to be more hands on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup the higher level API will be along with exsiting AutoML Experiment API.

@torronen
Copy link
Contributor

torronen commented Dec 9, 2021

From user perspective this looks great and promising! As I understand pipeline.SearchSpace will automatically keep track of the experiments, their results and parameters. So, basically, we could start each new tuning experiment by first running the best parameters from last experiment, then looping with tuner.Propose. Is it correct?

If possible, I'd also like to extract the results from pipeline.SearchSpace to help me get a better understanding of how the tuning affects performance. For example, I might want to plot a graph about how the number of trees affect FastForest to help me find optimal number considering training time vs. accuracy.

@LittleLittleCloud
Copy link
Contributor Author

From user perspective this looks great and promising! As I understand pipeline.SearchSpace will automatically keep track of the experiments, their results and parameters. So, basically, we could start each new tuning experiment by first running the best parameters from last experiment, then looping with tuner.Propose. Is it correct?

@torronen Quite close but not exactly accurate. Pipeline.SearchSpace will only define the boundary of parameters and and it's stateless. If you want to track parameter and their results under current api, you need to do it yourself. (save each parameter and their result in a dictionary or something..)

If possible, I'd also like to extract the results from pipeline.SearchSpace to help me get a better understanding of how the tuning affects performance. For example, I might want to plot a graph about how the number of trees affect FastForest to help me find optimal number considering training time vs. accuracy.

That's a great advice. It's also something that we want to heavily dive into on how to help users understand training process, like trend line of metric, importance of each parameter, etc etc...

@torronen
Copy link
Contributor

torronen commented Dec 10, 2021

@LittleLittleCloud ok I see. I assume Tuner needs to keep the results.

Something in general, the sweeping process is running through a loop. So it's very flexible on how you perform training and you can customize it into whatever way you want, like hold-out strategy, metric, or even training on multiple thread/on cloud...

// psudo code
// Input: Tuner, SearchSpace, SweepablePipeline, trainDataset, testDataset
foreach (var parameter in tuner.Propose())
{
   var mlnetPipeline = SweepablePipeline.BuildTrainingPipeline(parameter)

   // there're really no limit on how to train your model, the following example uses hold-out strategy
   var model = mlnetPipeline.Fit(trainDataset)
   var eval = model.Transform(testDataset)
   
   // there're also no limit on how to calculate your metric, you can either use built-in optimize metric of define your own
   var metric = calculateYourMetric(eval, truth)

   // for some tuners they need to be aware of metric so as to propose the possible best parameter next time.
   tuner.Update(parameter, metric)
}

Not sure if these are valid for this plan, but my next questions might be:

How to change the optimization metric?

optimization metric is a number calculated from predict and truth value so as long as you have access to predict and truth value (which is available) you can calculate whatever metric you want.

How to create custom optimization metric? See #5999 Maybe derived from other metrics (maybe as anonymous method?), or even create a mini-simulation. Based on my initial experiments with simulating real-life events, simulation results might be more sensitive to tuning than traditional metrics. For example, we may want to catch some of different types of malicious web requests to catch the bad actors. Same ML metrics may come for detecting 100% of all requests for a single type, but in this case we might only catch one bad actor. In a simulation we might want to see how early intruders are detected: it's not enough to detect intruder when the logout.

If ML.Net built-in optimization metric can't meet your request, you can define your own metric and calculate by yourself as long as you have predict and truth value. ML.Net DataFrame api might be helpful in this case.

How to create a new Tuner? Not sure if necessary. It can also be done without being this type of Tuner. Could be also one area where to make it easy for community to contribute their implementations as extension nugets, or even as part of Microsoft.ML if dev team thinks it is feasible.

You can implement ITuner interface to create your own tuner.

On a higher level API it will be not that easy to customize the entire training process. A fluent-style will be preferred here

var experiemnt = new AutoMLExperiment()
                               .AddPipeline(sweepablePipeline)
                               .AddTuner(tuner)
                               .AddMetric(async (predict, actual) => {// return metric})
                               .Train();

But you'll still have access to all low-level api so there'll be a way to customize the training.

@LittleLittleCloud
Copy link
Contributor Author

@torronen reply inline

@torronen
Copy link
Contributor

torronen commented Feb 4, 2022

@LittleLittleCloud BTW, does this proposal and search space API allow optimizing the booster in LightGBM? Or, should users run the tuning for DART, Gradient and Goss separately is they expect it to matter? Asking because it is an object, not value type, and the object has parameters inside it.

Or, maybe user can write his own if...else or switch depending on what the tuner wants to run next?

@LittleLittleCloud
Copy link
Contributor Author

@LittleLittleCloud BTW, does this proposal and search space API allow optimizing the booster in LightGBM? Or, should users run the tuning for DART, Gradient and Goss separately is they expect it to matter? Asking because it is an object, not value type, and the object has parameters inside it.

Or, maybe user can write his own if...else or switch depending on what the tuner wants to run next?

Yes you can, it really doesn't matter whether the property you want to sweep on is a value type or not, you can aways pass a factory method to create the trainer based on parameter sweeped from search space.

@michaelgsharp
Copy link
Contributor

@LittleLittleCloud Merging this in as the sweepable estimator itself is already checked in.

@michaelgsharp michaelgsharp merged commit 20692fe into dotnet:main Oct 12, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants