Description
openedon Jan 18, 2019
While reviewing #2176, I incidentally read what I take to be some bugs in this method, the schema shape propagation for key values:
As the name suggests, this "key to vector" transformer this produces a vector valued column out of a key-valued column. That key-valued column can be either a scalar, vector, or unknown size vector. Much of the logic in this method did not seem to conform to my understanding of what the transformer is doing (and, in one case, should be doing but isn't). It may be that we need some more test coverage for this, in addition to fixing the bugs, if I am right about any of these.
Bug 1: The Type Test
Let's review the type test:
I see some problems here:
-
This is the "key to vector" transform, yet, there is no test to see if it is a key?
-
Instead first we see that it fails if it's of an unknown
DataKind
(which seems totally besides the point). -
We also see that we have this totally unnecessary
ItemType.GetItemType()
, which considering the description is not necessary. -
Related: later on we see
col.ItemType is VectorType
, which according to the documentation here could never be the case anyway:
machinelearning/src/Microsoft.ML.Core/Data/IEstimator.cs
Lines 46 to 49 in bb92c06
col.ItemType is PrimitiveType
is totally redundant given that we already had the much stronger test that it was one of the "known" datakinds. (Which, again, is still completely wrong.)
So, I think practically, we might have a much stronger test if we dropped practically all this matter and just tested col.IsKey
. Maybe also verify that the raw type is not ulong
or something. But, the rest of this doesn't make a lot of sense to me.
Bug 2: Error message
Now what does that test lead to... it leads to this exception being thrown:
If you trace through to what becomes the exception message, it becomes this:
Which is of course not right. "Luckily," the test above was so screwed up that the only conceivable way a user could run into it is if they fed in an image type, if I interpret it correctly.
Bug 3: The metadata tests.
machinelearning/src/Microsoft.ML.Data/Transforms/KeyToVector.cs
Lines 774 to 780 in bb92c06
The first batch of metadata should verify that the type is of col.Kind == SchemaShape.Column.VectorKind.Vector
. Instead, it only verifies it's not a variable vector. but apparently scalar values are just fine. :)
For the second metadata, this is sort of a bug more on the transformer itself than the estimator: given that the behavior of bagging is only relevant on vectors of keys, CategoricalSlotRanges
should be added as metadata on a scalar source column only (or, I guess, ideally on a vector of length 1); if it's not a scalar (or vector of length 1), then it could not be a one-hot vector anyway, and so could not be encoding a categorical value. Might be more a @codemzs problem.
The third metadata looks actually reasonable, at least at first glance.
Bug 4: The column type.
If bagging is off, and the input is a variable sized vector, then the output will be a variable sized vector as well. Yet we see the result is unconditionally a .Vector
, which does not cover that case:
/cc @Ivanidzo4ka