Skip to content

Conditions in KeyToVectorMappingEstimator.GetOutputSchema seem wrong #2185

Open

Description

While reviewing #2176, I incidentally read what I take to be some bugs in this method, the schema shape propagation for key values:

public override SchemaShape GetOutputSchema(SchemaShape inputSchema)

As the name suggests, this "key to vector" transformer this produces a vector valued column out of a key-valued column. That key-valued column can be either a scalar, vector, or unknown size vector. Much of the logic in this method did not seem to conform to my understanding of what the transformer is doing (and, in one case, should be doing but isn't). It may be that we need some more test coverage for this, in addition to fixing the bugs, if I am right about any of these.

Bug 1: The Type Test

Let's review the type test:

if ((col.ItemType.GetItemType().RawKind == default) || !(col.ItemType is VectorType || col.ItemType is PrimitiveType))

I see some problems here:

  • This is the "key to vector" transform, yet, there is no test to see if it is a key?

  • Instead first we see that it fails if it's of an unknown DataKind (which seems totally besides the point).

  • We also see that we have this totally unnecessary ItemType.GetItemType(), which considering the description is not necessary.

  • Related: later on we see col.ItemType is VectorType, which according to the documentation here could never be the case anyway:

/// <summary>
/// The 'raw' type of column item: must be a primitive type or a structured type.
/// </summary>
public readonly ColumnType ItemType;

  • col.ItemType is PrimitiveType is totally redundant given that we already had the much stronger test that it was one of the "known" datakinds. (Which, again, is still completely wrong.)

So, I think practically, we might have a much stronger test if we dropped practically all this matter and just tested col.IsKey. Maybe also verify that the raw type is not ulong or something. But, the rest of this doesn't make a lot of sense to me.

Bug 2: Error message

Now what does that test lead to... it leads to this exception being thrown:

throw Host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);

If you trace through to what becomes the exception message, it becomes this:

return $"Could not find {columnRole} column '{columnName}'";

Which is of course not right. "Luckily," the test above was so screwed up that the only conceivable way a user could run into it is if they fed in an image type, if I interpret it correctly.

Bug 3: The metadata tests.

if (col.Metadata.TryFindColumn(MetadataUtils.Kinds.KeyValues, out var keyMeta))
if (col.Kind != SchemaShape.Column.VectorKind.VariableVector && keyMeta.ItemType is TextType)
metadata.Add(new SchemaShape.Column(MetadataUtils.Kinds.SlotNames, SchemaShape.Column.VectorKind.Vector, keyMeta.ItemType, false));
if (!colInfo.Bag && (col.Kind == SchemaShape.Column.VectorKind.Scalar || col.Kind == SchemaShape.Column.VectorKind.Vector))
metadata.Add(new SchemaShape.Column(MetadataUtils.Kinds.CategoricalSlotRanges, SchemaShape.Column.VectorKind.Vector, NumberType.I4, false));
if (!colInfo.Bag || (col.Kind == SchemaShape.Column.VectorKind.Scalar))
metadata.Add(new SchemaShape.Column(MetadataUtils.Kinds.IsNormalized, SchemaShape.Column.VectorKind.Scalar, BoolType.Instance, false));

The first batch of metadata should verify that the type is of col.Kind == SchemaShape.Column.VectorKind.Vector. Instead, it only verifies it's not a variable vector. but apparently scalar values are just fine. :)

For the second metadata, this is sort of a bug more on the transformer itself than the estimator: given that the behavior of bagging is only relevant on vectors of keys, CategoricalSlotRanges should be added as metadata on a scalar source column only (or, I guess, ideally on a vector of length 1); if it's not a scalar (or vector of length 1), then it could not be a one-hot vector anyway, and so could not be encoding a categorical value. Might be more a @codemzs problem.

The third metadata looks actually reasonable, at least at first glance.

Bug 4: The column type.

If bagging is off, and the input is a variable sized vector, then the output will be a variable sized vector as well. Yet we see the result is unconditionally a .Vector, which does not cover that case:

result[colInfo.Output] = new SchemaShape.Column(colInfo.Output, SchemaShape.Column.VectorKind.Vector, NumberType.R4, false, new SchemaShape(metadata));

/cc @Ivanidzo4ka

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly APIP1Priority of the issue for triage purpose: Needs to be fixed soon.bugSomething isn't workingimageBugs related image datatype tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions