Skip to content

Proposal: Experiment API #6118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions docs/specs/AutoML Experiment API Proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
## Overview
Experiment API is a set of API build to work with `SweepablePipeline`. Its aim is to make the interaction among `Tuner`, `Search Space` and `Sweepable Pipeline` transparent to external customers.

## Problem
Suppose that you have `sweepable pipeline`, `tuner` and `searchSpace`, and you want to optimize the `sweepable pipeline` over given `searchSpace` using that `tuner`. In order to make that happen, without `Experiment API`, you would need to mannually interact with tuner and searchSpace, building mlnet training pipeline, train and evaluate model, and tracing the best model/parameter. Thus process expose too many details and would not be considered easy-to-use for users to start with.

```csharp
// this training process just expose too many details.
var pipeline, tuner;

// search space comes with pipeline.
var searchSpace = pipeline.SearchSpace;

tuner.SetSearchSpace(searchSPace);
foreach(var parameter in tuner.Proposal())
{
// construct ML.Net pipeline from parameter
var mlPipeline = pipeline.BuildTrainingPipeline(parameter);

// evaluate
var model = mlPipeline.Fit(trainData);
var score = model.Evaluate(testData);

// code to save best model
if(score > bestScore)
{
bestScore = score;
bestModel = model;
}

// update tuner with score
tuner.Update(parameter, score);
}
```

## Solution: Experiment API

With Experiment api, we can make `pipeline`, `tuner` and `searchspace` transparent to users so they don't have to know how those parts work with each other. What replaces them is a higher-level concept: `Experiment`. `Experiment` will take the input from users, like training time, searching strategy, train/test/validation dataset, model saving strategy... After all input is given, experiment will take care the rest of training process.

```csharp
// Experiment api.
var pipeline, tuner;

var experiment = pipeline.CreateExperiment(trainTime = 100, trainDataset = "train.csv", split = "cv", folds = 10, metric = "AUC", tuner = tuner, monitor = monitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to have an overload that takes in a class ExperimentOptions where you can define these parameters:

var experimentOptions = new ExperimentOptions
{
    TrainTime = 100,
    TrainDataset = "train.csv",
    Split = "cv"
    Folds = 10,
    Metric = "AUC",
    tuner = tuner,
    monitor = monitor
};

var experiment = pipeline.CreateExperiment(experimentOptions);

For the split parameter, it'd be good to have something like an enum of some sort so you have the option of using it like this:

Split = Split.CV

Same thing for the metric, it'd be great to leverage the existing metric classes. For example, for binary classification:

Metric = BinaryClassificationMetrics.AreaUnderRocCurve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hold my opinion on metric class (using existing metric class). Using string will actually be easier because

  • mlnet have different metric classes for different scenario, which makes it hard for coding unless we end up with having different api for creating experiments for different scenarios as well
  • using string allows us to add metrics that not supported by mlnet. (like forecasting, which doesn't have a evaluate metric)

The only downside is little restriction on input, but that can be solved by documents or having a metric enum instead


// or fluent-style api
experiment = pipeline.CreateExperiment();

experiment.SetTrainingTime(100)
.SetDataset(trainDataset = "train.csv", split = "cv", fold = 10)
.SetEvaluationMetric(metric = "AUC") // or a lambda function which return a score
.SetTuner(tuner)
.SetMonitor(monitor);

experiment.Run()
monitor.ShowProgress();

// trial 1: score ... parameter ...
// trial 2: score ... parameter ...
```

### default classifiers and regressors

It will be useful if we provide an API that returns a combination of all available trainers with default search spaces.

```csharp
var featurizePipeline

// regression
var pipeline = featurizePipeline.Append(context.AutoML().Regressors(labelColumn = "label", useLgbm = true, useFastTree = false, ...));

// binary classification
var pipeline = featurizePipeline.Append(context.AutoML().BinaryClassification(labelColumn = "label", useLgbm = true, useFastTree = false, ...));

// multi-class classification
var pipeline = featurizePipeline.Append(context.AutoML().MultiClassification(labelColumn = "label", useLgbm = true, useFastTree = false, ...));

// univariant forecasting
var pipeline = featurizePipeline.Append(context.AutoML().Forcasting(labelColumn = "label", horizon ...));

// create Experiment
var exp = pipeline.CreateExperiment();
...

```