Description
The following is a preliminary list of required scenarios for the direct access API, that we will use to focus the work. The goal is we want the experience for these to be good and unproblematic. Strictly speaking everything here is possible to do right now using the components as they stand implemented today. However, I would say that it isn't necessarily a joy to do them, and there are lots of potential "booby traps" lurking in the code unless you do everything exactly correctly (e.g., #580).
-
Simple train and predict: Start with a dataset in a text file. Run text featurization on text values. Train a linear model over that. (I am thinking sentiment classification.) Out of the result, produce some structure over which you can get predictions programmatically (e.g., the prediction does not happen over a file as it did during training)..
-
Multi-threaded prediction. A twist on "Simple train and predict", where we account that multiple threads may want predictions at the same time. Because we deliberately do not reallocate internal memory buffers on every single prediction, the PredictionEngine (or its estimator/transformer based successor) is, like most stateful .NET objects, fundamentally not thread safe. This is deliberate and as designed. However, some mechanism to enable multi-threaded scenarios (e.g., a web server servicing requests) should be possible and performant in the new API.
-
Train, save/load model, predict: Serve the scenario where training and prediction happen in different processes (or even different machines). The actual test will not run in different processes, but will simulate the idea that the "communication pipe" is just a serialized model of some form.
-
Train with validation set: Similar to the simple train scenario, but also support a validation set. THe learner might be trees with early stopping.
-
Train with initial predictor: Similar to the simple train scenario, . The scenario might be one of the online linear learners that can take advantage of this, e.g., averaged perceptron.
-
Evaluation: Similar to the simple train scenario, except instead of having some predictive structure, be able to score another "test" data file, run the result through an evaluator and get metrics like AUC, accuracy, PR curves, and whatnot. Getting metrics out of this shoudl be as straightforward and unannoying as possible.
-
Auto-normalization and caching: It should be relatively easy for normalization and caching to be introduced for training, if the trainer supports or would benefit from that.
-
File-based saving of data: Come up with transform pipeline. Transform training and test data, and save the featurized data to some file, using the
.idv
format. Train and evaluate multiple models over that pre-featurized data. (Useful for sweeping scenarios, where you are training many times on the same data, and don't necessarily want to transform it every single time.) -
Decomposable train and predict: Train on Iris multiclass problem, which will require a transform on labels. Be able to reconstitute the pipeline for a prediction only task, which will essentially "drop" the transform over labels, while retaining the property that the predicted label for this has a key-type, the probability outputs for the classes have the class labels as slot names, etc. This should be do-able without ugly compromises like, say, injecting a dummy label.
-
Cross-validation: Have a mechanism to do cross validation, that is, you come up with a data source (optionally with stratification column), come up with an instantiable transform and trainer pipeline, and it will handle (1) splitting up the data, (2) training the separate pipelines on in-fold data, (3) scoring on the out-fold data, (4) returning the set of evaluations and optionally trained pipes. (People always want metrics out of xfold, they sometimes want the actual models too.)
-
Reconfigurable predictions: The following should be possible: A user trains a binary classifier, and through the test evaluator gets a PR curve, the based on the PR curve picks a new threshold and configures the scorer (or more precisely instantiates a new scorer over the same predictor) with some threshold derived from that.
-
Introspective training: Models that produce outputs and are otherwise black boxes are of limited use; it is also necessary often to understand at least to some degree what was learnt. To outline critical scenarios that have come up multiple times:
- When I train a linear model, I should be able to inspect coefficients.
- The tree ensemble learners, I should be able to inspect the trees.
- The LDA transform, I should be able to inspect the topics.
I view it as essential from a usability perspective that this be discoverable to someone without having to read documentation. E.g.: if I have
var lda = new LdaTransform().Fit(data)
(I don't insist on that exact signature, just giving the idea), then if I were to typelda.
in Visual Studio, one of the auto-complete targets should be something likeGetTopics
. -
Exporting models: Models when defined ought to be exportable, e.g., to ONNX, PFA, text, etc.
-
Visibility: It should, possibly through the debugger, be not such a pain to actually see what is happening to your data when you apply this or that transform. E.g.: if I were to have the text
"Help I'm a bug!"
I should be able to see the steps where it is normalized to"help i'm a bug"
then tokenized into["help", "i'm", "a", "bug"]
then mapped into term numbers[203, 25, 3, 511]
then projected into the sparse float vector{3:1, 25:1, 203:1, 511:1}
, etc. etc. -
Meta-components: Meta-components (e.g., components that themselves instantiate components) should not be booby-trapped. When specifying what trainer OVA should use, a user will be able to specify any binary classifier. If they specify a regression or multi-class classifier ideally that should be a compile error.
-
Extensibility: We can't possibly write every conceivable transform and should not try. It should somehow be possible for a user to inject custom code to, say, transform data. This might have a much steeper learning curve than the other usages (which merely involve usage of already established components), but should still be possible.
Companion piece for #583.
/cc @Zruty0 , @eerhardt , @ericstj , @zeahmed , @CESARDELATORRE .