Direct API: Auto-normalization

One of the details of training that happen after a loader/transform pipeline is created, but before the cache. We've typically automatically done this for users. While usage in the API is very distinct in that people tend to like implicit behavior in tools but dislike implicit behavior in APIs, at least offering a convenience for normalization is appropriate.

## Existing Method

Some familiar with this codebase are aware of this existing method, in the `TrainUtils` utility class, that serves a similar function.

https://github.com/dotnet/machinelearning/blob/2501049f5cb60ed2c9ec191d2937cab7b59824da/src/Microsoft.ML.Data/Commands/TrainCommand.cs#L492

The goal of *that* method is not to provide a convenient API, so much as to factor out code common to the various commands that train models (e.g., train, traintest, cross-validation, some transforms like train-and-score). The same is true of many methods in that `TrainUtils` class. This as we see in the first few lines:

https://github.com/dotnet/machinelearning/blob/2501049f5cb60ed2c9ec191d2937cab7b59824da/src/Microsoft.ML.Data/Commands/TrainCommand.cs#L499-L506

While beneficial in providing consistent behavior across all of these things from a command-line perspective, the condition where it just exits would be inappropriate to have in an ML.NET API -- you might imagine someone designing a method with a parameter `bool doNothing` where the first thing is, if it's true, the method returns without doing anything. Again, appropriate from the point of view of factoring out common code, but not appropriate for an API. Also the method of communicating important information to the user is via the console, which again is not the most helpful option for an API.

## Proposed API Helpers

Nonetheless, this function has several things that are helpful to do: it detects if a trainer wants normalization, if data is normalized, and if appropriate and necessary applies normalization.

This would probably take the form of a static method on the `NormalizerTransform` class, perhaps following this signature:

```csharp
public static bool CreateIfNeeded(IHostEnvironment env, ref RoleMappedData data, ITrainer trainer)
```

We could also have two additional methods to provide key information.

```csharp
public static bool FeatureVectorIsNormalized(RoleMappedData data)
public static bool NeedsNormalization(this ITrainer trainer)
```

	ch.CheckUserArg(Enum.IsDefined(typeof(NormalizeOption), autoNorm), nameof(TrainCommand.Arguments.NormalizeFeatures),
	"Normalize option is invalid. Specify one of 'norm=No', 'norm=Warn', 'norm=Auto', or 'norm=Yes'.");

	if (autoNorm == NormalizeOption.No)
	{
	ch.Info("Not adding a normalizer.");
	return false;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Direct API: Auto-normalization #433

Existing Method

Proposed API Helpers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Direct API: Auto-normalization #433

Description

Existing Method

Proposed API Helpers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions