-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Proposal: Sweepable API #5993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Sweepable API #5993
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
# AutoML.Net Sweepable API proposal | ||
## Overview | ||
Sweepable api allows mlnet users to create their own search space and pipeline for hyper-parameter optimization (HPO). It comes with three major part: [search space](#search-space), [sweepable estimator](#sweepable-estimator) and [tuner](#tuner). And all API lives under `Sweepable()` extension (for now). | ||
|
||
## search space | ||
Search space defines a range of hyper-parameter for tuner to search from. Sweepable API provides two way to create a search space. | ||
|
||
via attribute | ||
```csharp | ||
public class Option | ||
{ | ||
[Range(2, 32768, init: 2, logBase: true)] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to have a reflection based solution to start? Or would the "weakly-typed" API solve most of the scenarios: var ss = new SearchSpace();
ss.Add("WindowSize", new UniformIntOption(2, 32768, true, 2));
ss.Add("SeriesLength", new ChoiceOption(2,3,4));
ss.Add("UseSoftmax", new ChoiceOption(true, false));
ss.Add("AnotherOption", ss.Clone()) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason I ask is: I would not add multiple ways to do something right away. Instead, make the "core" thing first, and the simpler ones can be built later, if necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reflection API (let's call it strong-typed correspond with "weakly-typed") is just a handy way for creating search space and it's built on top of "weakly-typed". So if we must make a choice at the beginning I would pick "weakly-typed" to implement/migrate first. |
||
public int WindowSize {get; set;} | ||
|
||
// one of [2, 3, 4] | ||
[Choice(2, 3, 4)] | ||
public int SeriesLength {get; set;} | ||
|
||
// one of [true, false] | ||
[Choice] | ||
public bool UseSoftmax {get; set;} | ||
|
||
// nested search space | ||
[Option] | ||
public Option AnotherOption {get;set;} | ||
} | ||
|
||
var ss = new SearchSpace<Option>(); | ||
|
||
// each search space has a 1-d feature space where each feature is [0, 1). And search space will handle the mapping between hpo space and feature space so that tuner only needs to perform search on feature space, which both dimension and range are known. | ||
var parameter = ss.SampleFromFeatureSpace(new []{0,0,0,0,0,0}); | ||
|
||
// auto-binding | ||
parameter.WindowSize.Should().Be(2); | ||
parameter.SeriesLength.Should().Be(2); | ||
parameter.UseSoftmax.Should().BeTrue(); | ||
parameter.AnotherOption.WindowSize.Should().Be(2); | ||
|
||
// search space can also map parameter back to feature space | ||
ss.MappingToFeatureSpace(parameter).Should().BeEquivalantTo(0.0, 0.0, 0.0, 0.0, 0.0, 0.0); | ||
``` | ||
|
||
or correspondingly, via scratch | ||
``` csharp | ||
var ss = new SearchSpace(); | ||
ss.Add("WindowSize", new UniformIntOption(2, 32768, true, 2)); | ||
ss.Add("SeriesLength", new ChoiceOption(2,3,4)); | ||
ss.Add("UseSoftmax", new ChoiceOption(true, false)); | ||
ss.Add("AnotherOption", ss.Clone()); | ||
|
||
var parameter = ss.SampleFromFeatureSpace(new []{0,0,0,0,0,0}); | ||
|
||
// auto-binding doesn't exist for scratch api | ||
parameter["WindowSize"].AsType<int>.Should().Be(2); | ||
parameter["SeriesLength"].AsType<int>.Should().Be(2); | ||
parameter["UseSoftmax"].AsType<bool>.Should().BeTrue(); | ||
parameter["AnotherOption"]["WindowSize"].AsType<int>.Should().Be(2); | ||
|
||
// search space can also map parameter back to feature space | ||
ss.MappingToFeatureSpace(parameter).Should().BeEquivalantTo(0.0, 0.0, 0.0, 0.0, 0.0, 0.0); | ||
``` | ||
|
||
Currently, in order to make auto-binding work, there's a limitation on the parameter type that can be added to search space, which has to be either a Json primitive type or a nested search space. | ||
|
||
## sweepable estimator | ||
sweepable estimator allows user to combine search space with estimators in a similar way how ml.net estimator/pipeline are created. You use `CreateSweepableEstimator`, which accepts a lambda function and a search space to create a sweepable estimator. And it also provides `.Append` extension method so you can append sweepable estimator similiar with how you append other ml.net extimator. The bellow example presents how to create a pipeline with two sweepable estimators, one for text featurizor and the other for fast tree, for `titanic` dataset via `Sweepable()` extension. | ||
|
||
``` csharp | ||
var context = new MLContext(); | ||
var fastTreeSS = new SearchSpace<FastTreeOption>(); | ||
var textFeaturizeSS = new SearchSpace<FeaturizeTextOption>(); | ||
|
||
var pipeline = context.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"Sex", @"Sex"), new InputOutputColumnPair(@"Embarked", @"Embarked") }) | ||
.Append(context.Transforms.Concatenate(@"TextFeature", @"Name", "Ticket", "Cabin")) | ||
.Append(context.Sweepable().CreateSweepableEstimator( | ||
michaelgsharp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(mlContext, option) => | ||
{ | ||
var textOption = new TextFeaturizingEstimator.Options | ||
{ | ||
CaseMode = option.CaseMode, | ||
KeepDiacritics = option.KeepDiacritics, | ||
KeepNumbers = option.KeepNumbers, | ||
KeepPunctuations = option.KeepPunctuations, | ||
CharFeatureExtractor = new WordBagEstimator.Options() | ||
{ | ||
NgramLength = option.WordBagEstimatorOption.NgramLength, | ||
UseAllLengths = option.WordBagEstimatorOption.UseAllLengths, | ||
Weighting = option.WordBagEstimatorOption.WeightingCriteria, | ||
}, | ||
}; | ||
|
||
return mlContext.Transforms.Text.FeaturizeText("TextFeature", textOption); | ||
}, | ||
textFeaturizeSS)) | ||
.Append(context.Transforms.Concatenate(@"Features", new[] { @"Sex", @"Embarked", @"Pclass", @"Age", @"SibSp", @"Parch", @"Fare", "TextFeature" })) | ||
.Append(context.Transforms.Conversion.ConvertType("Survived", "Survived", Data.DataKind.Boolean)) | ||
.Append(context.Sweepable().CreateSweepableEstimator( | ||
(mlContext, option) => mlContext.BinaryClassification.Trainers.FastForest(labelColumnName: "Survived", featureColumnName: "Features", numberOfLeaves: option.NumberOfLeavenumberOfTrees: option.NumberOfTrees), | ||
fastTreeSS)) | ||
.Append(context.BinaryClassification.Calibrators.Naive(labelColumnName: @"Survived", scoreColumnName: @"Score")); | ||
``` | ||
|
||
After sweepable pipeline is created, one can call `BuildTrainingPipeline` to convert it to a ml.net pipeline. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will the type of the pipeline be the same as a normal ML.NET pipeline? What happens if someone tries to run it without calling that method first? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No it won't be the same ML.Net pipeline. And user needs to call I feel like that |
||
|
||
## tuner | ||
|
||
tuner takes in a search space and performs hpo algos. There're a few default tuning alogs provided by Sweepable API (grid search/random search), and there'll be more smart hpo algos coming soon. | ||
|
||
The way to use a tuner is quite similar with the way to use an enumerator. The code below shows how tuner works with search space and sweepable estimator/pipeline | ||
|
||
```csharp | ||
var ss = pipeline.SearchSpace | ||
var tuner = new GridSearchTuner(ss); | ||
var df = DataFrame.LoadCsv(@"titanic.csv"); | ||
var trainTestSplit = context.Data.TrainTestSplit(df, 0.1); | ||
var bestAccuracy = 0.0; | ||
var i = 0; | ||
foreach (var param in tuner.Propose()) | ||
{ | ||
Console.WriteLine($"trial {i++}"); | ||
|
||
// convert sweepable pipeline to ml.net pipeline | ||
var trainingPipeline = pipeline.BuildTrainingPipeline(context, param); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious, but would there ever be a need to save the current state of the sweeper so that it can be stopped and then resumed at a later date? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If sweeper supports that then yes. In fact continue training has already been supported in model builder cfo sweeper, it's just not exposed yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe for the future, but how about "import experiments"? That would help in case someone wants to distribute training to multiple nodes: a) save each experiment results to db b) before starting next iteration, fetch new from db and import There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @torronen |
||
var model = trainingPipeline.Fit(trainTestSplit.TrainSet); | ||
var eval = model.Transform(trainTestSplit.TestSet); | ||
var accuracy = context.BinaryClassification.Evaluate(eval, "Survived").Accuracy; | ||
if (accuracy > bestAccuracy) | ||
{ | ||
Console.WriteLine("Found best accuracy"); | ||
Console.WriteLine("Current best parameter"); | ||
Console.WriteLine(JsonConvert.SerializeObject(param)); | ||
bestAccuracy = accuracy; | ||
|
||
Console.WriteLine($"Trial {i}: Current Best Accuracy {bestAccuracy}, Current Accuracy {accuracy}"); | ||
} | ||
} | ||
|
||
``` | ||
|
||
You can visit [here](https://github.com/dotnet/machinelearning-tools/blob/main/src/Microsoft.ML.ModelBuilder.AutoMLService.Example/Program.cs) to try out the complete training code. | ||
|
||
## The difference between sweepable api and existing api in AutoML.Net | ||
The exsiting API in AutoML.Net performs hpo on pre-defined search space and learners with smac tuning algo while sweepable api allows user to customize those settings: they can define their own search space, they can create pipeline similarly with how it is created in ml.net, and they can pick the tuner which suit their experiment the best. | ||
|
||
## Q & A | ||
|
||
What's the difference between Sweepable API and AutoML.Net experiments | ||
- AutoML.Net experiments provide an oobe experience for automl with fixed hpo-related option, while Sweepable API requires user to set up those options (search space, pipeline, tuner) manually before training. Sweepable API is not aimed to replace existing AutoML.Net API, but to provide another options for those customers whose request can't be satified by default automl experiment. | ||
|
||
Will AutoML.Net benefits from Sweepable API | ||
- Yes it will, along with sweepable API, we can also leverage [flaml](https://github.com/microsoft/FLAML) technique in the tuning algothrim, which helps improves tuning speed and performance. | ||
|
||
Why do we need sweepable API, who will be the beneficaries, what's the most common user-case. | ||
- Sweepable API is prepared for HPO, which is a feature required by community from ever since mlnet being released. ([#613](https://github.com/dotnet/machinelearning/issues/613), [#2260](https://github.com/dotnet/machinelearning/issues/2260), [#5930](https://github.com/dotnet/machinelearning/issues/5930),[#1875](https://github.com/dotnet/machinelearning-modelbuilder/issues/1875) etc...). The major beneficiaries will be those who'd like to run automl on different search space or trainers other than the fixed parameters provided by AutoML.Net. The most common user-case, as I'm imagining, is on Notebook along with DataFrame API. | ||
|
||
What's the timeline for intergrating Sweepable API into ML.Net | ||
- it depends. Sweepable API has three major parts and all are available in model builder repo. On top of which is a thin layer which makes the three part work with ML.Net. So if the plan is only moving that thin layer into AutoML.Net first, and make the rest of three parts open source later, it will take 2-3 weeks. If we want to move all parts into mlnet all together, it will cost much more time. | ||
|
||
|
||
## Current feedback | ||
- the lambda function in `CreateSweepableEstimator` should not accept `MLContext` as first parameter. | ||
- Should provide a higher level API for training similar with Experiment API in AutoML.Net. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should that higher level API be part of this though? Wouldn't that belong more in the AutoML side and less with the sweeper side since the sweeper is "lower level" and is expected to be more hands on? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup the higher level API will be along with exsiting AutoML Experiment API. |
||
- (From @torronen) Provides ways to access immediate results (accuracy/parameter/pipeline) to help better understand training process |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericstj @eerhardt @michaelgsharp @justinormont
Hey All!
This proposal is to bring over the lower layers of our tooling's AutoML. All of the techniques that we worked with MSR to use in tooling (NNI, FLAML) are built on top of this.
@LittleLittleCloud is proposing that we bring over these lower layers first and then bring over the actual NNI+FLAML based AutoML next or upgrade AutoMl.NET to be the NNI+FLAML based approach. I think it's possible we could keep or nearly keep the same AutoMl.NET API.
I'm open to keeping these layers internal if we need to ... but sweeping is a common thing to do in ML world. We've had several community members come up with their own ways to sweep over models and parameters. This would just expose our method of doing it.
The end goal is to only have one ML.NET AutoML.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwood803 ( I saw your like ) we'd love to hear your feedback by as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @JakeRadMSFT. I definitely like having this API! This helps make it more on part with scikit-learn that has grid search. Plus, I believe this will help model creators get better models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion would be to start with them internal, and ensure it can meet the existing AutoML scenarios. Then make them public as we go when we have data that supports they are needed to be public.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, if we want to start from internal first, a smooth start without breaking change can be search space, either through package reference or source code. The reasons are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a fine plan, including keeping as internal for a bit.
If it fits within your methods, I'd recommend having the default sweeping space { range, scaling, data type } as part of the component being swept (as current params). For AutoML․NET, due to non-great reasons, we kept a duplicate copy. Traveling with the component is more clean since the ranges show up (and disappear) with the availability of the component. For instance, as the user includes the NuGet for FastTree, the ranges are immediately available. In addition the ranges can be created inline with the new component.
Some historical perspective, there is also a command-line sweeper in ML․NET, which may be working (anything untested is assumed non-functional). This does let users sweep an individual (or sets of) components including fully switching out portions of the pipeline. The MAML command-line sweeper was very powerful, but hard to use. Mentioned previously: #5019 (comment)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, can we use System.Text.Json?
I agree that this is a good strategy. It's better to keep the defaults with the component/trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently no because auto-binding experience depends heavily on
JToken
. However, there's a plan to remove this dependency so in the future maybe.