Skip to content

Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters #2064

Closed

Description

So, a somewhat embarrassing blunder to report... not quite sure how to say this.

Throughout our codebase, for years we've always specified the name of the output column of a transform first, then the source(s).

That's good. But when estimators were introduced, somehow, nearly all of them were introduced in the reverse order: nearly all of them specify the inputs, then the outputs. This was probably an unconscious mistake, but it's one with fairly wide consequences, since that mistaken pattern was copied again and again as people did the estimator conversion work, to the point where most (not all!) estimators are now inconsistent with the whole rest of the codebase!

This should be corrected: it is at least inconsistent, and even if not inconsistent actually rather obnoxious practically because specifying the name of the output first has practical benefits, and makes a lot more sense, since if you're specifying a transformation the most important information someone will want to know is what you're calling your output!

The Story Until a Few Months Ago

So, throughout our codebase, for years, it has been our practice that when specifying a transform, you specify the name of the output, then you specify the name of the input(s) (if any). The reason for this is practical: the outputs are usually the result of a well defined single calculation (the application of the transform), whereas what is taken as an "input" to a transform can have various meanings since it is sometimes a multivariate function in its inputs more often than in its outputs. (E.g., concatenate can have multiple inputs, text can have multiple inputs, MI feature selection takes multiple heterogeneous inputs, but all produce single output columns.)

This was our practice certainly when this code was a tool, as we see in the command line help string, specifying name then source.

[Argument(ArgumentType.Multiple | ArgumentType.Required, HelpText = "New column definition(s) (optional form: name:src)", ShortName = "col", SortOrder = 1)]
public Column[] Column;

This trend continued during our initial attempts at an API, as seen in this early discussion in PR 405, and as seen here:

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/ConvertTransform.cs#L116

and here

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/DropSlotsTransform.cs#L226

and numerous other places.

This is done for the practical reason that, when a transform produces an
output, what outputs it has are usually finite and well defined, whereas it can take multiple examles. The most conspicuous and widely used example of this is
the concatenation transform. Also included are things like the text featurizing transform, and other such things.

So far so good...

But Then...

Now, somehow, through some mechanism that wasn't quite clear to me, as IEstimator implementations are being created, someone commits a fateful mistake of reversing inputs and outputs. Suddenly instead of being typically parameterized as name and a sometimes optional source, we instead have the required input and a sometimes optional output. What a disaster! And, I did not catch it in review. An early example:

public TextTransform(IHostEnvironment env, IEnumerable<string> inputColumns, string outputColumn,
Action<Settings> advancedSettings = null)

Then, as people use this as a template for their own estimator conversion work, nearly all estimators copied this mistake, until practically all estimators had this problem. This includes much of the extensions.

public static ColumnCopyingEstimator CopyColumns(this TransformsCatalog catalog, params (string source, string name)[] columns)
=> new ColumnCopyingEstimator(CatalogUtils.GetEnvironment(catalog), columns);

Now then, the reason why I know any of this is that @stephentoub wrote and says, "hey, how come you have your column concatenation operation specify output then inputs? That's different from everywhere else! I know it's less convenient, but it's better to be consistent."

public static ColumnConcatenatingEstimator Concatenate(this TransformsCatalog catalog, string outputColumn, params string[] inputColumns)

And, after thinking that he must be very confused, since again, the pattern of name then source being very, very well defined throughout the codebase, I look and find he is absolutely correct, and a thoroughly massive blunder had somehow made it throughout the entire codebase under our very noses, including it must be said mine. :) :)

So, this is bad, but happily this was caught before v1. But, needless to say this must be fixed immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

APIIssues pertaining the friendly API

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions