Skip to content

Commit 1473f55

Browse files
committed
Key type documentation. (dotnet#3194)
* Key type documentation.
1 parent 62a5b34 commit 1473f55

File tree

2 files changed

+190
-142
lines changed

2 files changed

+190
-142
lines changed

docs/code/KeyValues.md

Lines changed: 135 additions & 120 deletions
Original file line numberDiff line numberDiff line change
@@ -1,149 +1,164 @@
11
# Key Values
22

3-
Most commonly, key-values are used to encode items where it is convenient or
4-
efficient to represent values using numbers, but you want to maintain the
5-
logical "idea" that these numbers are keys indexing some underlying, implicit
6-
set of values, in a way more explicit than simply mapping to a number would
7-
allow you to do.
3+
Most commonly, in key-valued data, each value takes one of a limited number of
4+
distinct values. We might view them as being the enumeration into a set. They
5+
are represented in memory using unsigned integers. Most commonly this is
6+
`uint`, but `byte`, `ushort`, and `ulong` are possible values to use as well.
87

98
A more formal description of key values and types is
109
[here](IDataViewTypeSystem.md#key-types). *This* document's motivation is less
1110
to describe what key types and values are, and more to instead describe why
1211
key types are necessary and helpful things to have. Necessarily, this document,
1312
is more anecdotal in its descriptions to motivate its content.
1413

15-
Let's take a few examples of transforms that produce keys:
14+
Let's take a few examples of transformers that produce keys:
1615

17-
* The `TermTransform` forms a dictionary of unique observed values to a key.
18-
The key type's count indicates the number of items in the set, and through
19-
the `KeyValue` metadata "remembers" what each key is representing.
16+
* The `ValueToKeyMappingTransformer` has a dictionary of unique values
17+
obvserved when it was fit, each mapped to a key-value. The key type's count
18+
indicates the number of items in the set, and through the `KeyValue`
19+
annotation "remembers" what each key is representing.
2020

21-
* The `HashTransform` performs a hash of input values, and produces a key
22-
value with count equal to the range of the hash function, which, if a b bit
23-
hash was used, will produce a 2ᵇ hash.
21+
* The `TokenizingByCharactersTransformer` will take input strings and produce
22+
key values representing the characters observed in the string. The
23+
`KeyValue` annotation "remembers" what each key is representing. (Note that
24+
unlike many other key-valued operations, this uses a representation type of
25+
`ushort` instead of `uint`.)
2426

25-
* The `CharTokenizeTransform` will take input strings and produce key values
26-
representing the characters observed in the string.
27+
* The `HashingTransformer` performs a hash of input values, and produces a key
28+
value with count equal to the range of the hash function, which, if a `b`
29+
bit hash was used, will produce values with a key-type of count `2ᵇ` .
30+
31+
Note that in the first two cases, these are enumerating into a set with actual
32+
specific values, whereas in the last case we are also enumerating into a set,
33+
but one without values, since hashes don't intrinsically correspond to a
34+
single item.
2735

2836
## Keys as Intermediate Values
2937

3038
Explicitly invoking transforms that produce key values, and using those key
31-
values, is sometimes helpful. However, given that most trainers expect the
32-
feature vector to be a vector of floating point values and *not* keys, in
39+
values, is sometimes helpful. However, given that trainers typically expect
40+
the feature vector to be a vector of floating point values and *not* keys, in
3341
typical usage the majority of usages of keys is as some sort of intermediate
3442
value on the way to that final feature vector. (Unless, say, doing something
35-
like preparing labels for a multiclass learner.)
36-
37-
So why not go directly to the feature vector, and forget this key stuff?
38-
Actually, to take text as the canonical example, we used to. However, by
39-
structuring the transforms from, say, text to key to vector, rather than text
40-
to vector *directly*, we are able to simplify a lot of code on the
41-
implementation side, which is both less for us to maintain, and also for users
42-
gives consistency in behavior.
43-
44-
So for example, the `CharTokenize` above might appear to be a strange choice:
45-
*why* represent characters as keys? The reason is that the ngram transform is
46-
written to ingest keys, not text, and so we can use the same transform for
47-
both the n-gram featurization of words, as well as n-char grams.
43+
like preparing labels for a multiclass trainer.)
44+
45+
So why not go directly to the feature vector from whatever the input was, and
46+
forget this key stuff? Actually, to take text processing as the canonical
47+
example, we used to. However, by structuring the transforms from, say, text to
48+
key to vector, rather than text to vector *directly*, we were able to make a
49+
more flexible pipeline, and re-use smaller, simpler components. Having
50+
multiple composable transformers instead of one "omni-bus" transformer that
51+
does everything makes the process easier to understand, maintain, and exploit
52+
for novel purposes, while giving people greater visibility into the
53+
composability of what actually happens.
54+
55+
So for example, the `TokenizingByCharactersTransformer` above might appear to
56+
be a strange choice: *why* represent characters as keys? The reason is that
57+
the ngram transform which often comes after it, is written to ingest keys, not
58+
text, and so we can use the same transform for both the n-gram featurization
59+
of words, as well as n-char grams.
4860

4961
Now, much of this complexity is hidden from the user: most users will just use
50-
the `text` transform, select some options for n-grams, and chargrams, and not
51-
be aware of these internal invisible keys. Similarly, use the categorical or
52-
categorical hash transforms, without knowing that internally it is just the
53-
term or hash transform followed by a `KeyToVector` transform. But, keys are
54-
still there, and it would be impossible to really understand ML.NET's
55-
featurization pipeline without understanding keys. Any user that wants to
56-
understand how, say, the text transform resulted in a particular featurization
62+
the text featurization transform, select some options for n-grams, and
63+
chargrams, and not necessarily have to be aware of the usage of these internal
64+
keys, at least. Similarly, this user can use the categorical or categorical
65+
hash transforms, without knowing that internally it is just the term or hash
66+
transform followed by a `KeyToVectorMappingTransformer`. But, keys are still
67+
there, and it would be impossible to really understand ML.NET's featurization
68+
pipeline without understanding keys. Any user that wants to debug how, say,
69+
the text transform's multiple steps resulted in a particular featurization
5770
will have to inspect the key values to get that understanding.
5871

59-
## Keys are not Numbers
72+
## The Representation of Keys
6073

6174
As an actual CLR data type, key values are stored as some form of unsigned
62-
integer (most commonly `uint`). The most common confusion that arises from
63-
this is to ascribe too much importance to the fact that it is a `uint`, and
64-
think these are somehow just numbers. This is incorrect.
75+
integer (most commonly `uint`, but the other unsigned integer types are legal
76+
as well). One common confusion that arises from this is to ascribe too much
77+
importance to the fact that it is a `uint`, and think these are somehow just
78+
numbers. This is incorrect.
79+
80+
Most importantly, that the cardinality of the set they're enumerating is part
81+
of the type is critical information. In an `IDataView`, these are represented
82+
by the `KeyDataViewType` (or a vector of those types), with `RawType` being
83+
one of the aforementioned .NET unsigned numeric types, and most critically
84+
`Count` holding the cardinality of the set being represented. By encoding this
85+
in the schema, one can tell in downstream `ITransformer`s.
6586

6687
For keys, the concept of order and difference has no inherent, real meaning as
6788
it does for numbers, or at least, the meaning is different and highly domain
68-
dependent. Consider a numeric `U4` type, with values `0`, `1`, and `2`. The
69-
difference between `0` and `1` is `1`, and the difference between `1` and `2`
70-
is `1`, because they're numbers. Very well: now consider that you train a term
71-
transform over the input tokens `apple`, `pear`, and `orange`: this will also
72-
map to the keys logically represented as the numbers `0`, `1`, and `2`
73-
respectively. Yet for a key, is the difference between keys `0` and `1`, `1`?
89+
dependent. Consider a numeric `uint` type (specifically,
90+
`NumberDataViewType.UInt32`), with values `0`, `1`, and `2`. The difference
91+
between `0` and `1` is `1`, and the difference between `1` and `2` is `1`,
92+
because they're numbers. Very well: now consider that you call
93+
`ValueToKeyMappingEstimator.Fit` to get the transformer over the input tokens
94+
`apple`, `pear`, and `orange`: this will also map to the keys physically
95+
represented as the `uint`s `1`, `2`, and `3` respectively, which corresponds
96+
to the logical ordinal indices of `0`, `1`, and `2`, again respectively.
97+
98+
Yet for a key, is the difference between the logical indices `0` and `1`, `1`?
7499
No, the difference is `0` maps to `apple` and `1` to `pear`. Also order
75-
doesn't mean one key is somehow "larger," it just means we saw one before
76-
another -- or something else, if sorting by value happened to be selected.
77-
78-
Also: ML.NET's vectors can be sparse. Implicit entries in a sparse vector are
79-
assumed to have the `default` value for that type -- that is, implicit values
80-
for numeric types will be zero. But what would be the implicit default value
81-
for a key value be? Take the `apple`, `pear`, and `orange` example above -- it
82-
would inappropriate for the default value to be `0`, because that means the
83-
result is `apple`, would be appropriate. The only really appropriate "default"
84-
choice is that the value is unknown, that is, missing.
85-
86-
An implication of this is that there is a distinction between the logical
87-
value of a key-value, and the actual physical value of the value in the
88-
underlying type. This will be covered more later.
89-
90-
## As an Enumeration of a Set: `KeyValues` Metadata
91-
92-
While keys can be used for many purposes, they are often used to enumerate
93-
items from some underlying set. In order to map keys back to this original
94-
set, many transform producing key values will also produce `KeyValues`
95-
metadata associated with that output column.
96-
97-
Valid `KeyValues` metadata is a vector of length equal to the count of the
98-
type of the column. This can be of varying types: it is often text, but does
99-
not need to be. For example, a `term` applied to a column would have
100-
`KeyValue` metadata of item type equal to the item type of the input data.
101-
102-
How this metadata is used downstream depends on the purposes of who is
103-
consuming it, but common uses are: in multiclass classification, for
100+
doesn't mean one key is somehow "larger," it just sometimes means we saw one
101+
before another -- or something else, if sorting by value happened to be
102+
selected, or if the dictionary was constructed in some other fashion.
103+
104+
There's also the matter of default values. For key values, the default key
105+
value should be the "missing" value for they key. So logically, `0` is the
106+
missing value for any key type. The alternative is that the default value
107+
would be whatever key value happened to correspond to the "first" key value,
108+
which would be very strange and unnatural. Consider the `apple`, `pear`, and
109+
`orange` example above -- it would be inappropriate for the default value to
110+
be `apple`, since that's fairly arbitrary. Or, to extend this reasoning to
111+
sparse `VBuffer`s, would it be appropriate for a sparse `VBuffer` of key to
112+
have a value of `apple` for every implicit value? That doesn't make sense. So,
113+
the default value is the missing value.
114+
115+
One of the more confusing consequences of this is that since, practically,
116+
these key values are more often than not used as indices of one form or
117+
another, and the first non-missing value is `1`, that in certain circumstances
118+
like, say, writing out key values to text, that non-missing values will be
119+
written out starting at `0`, even though physically they are stored starting
120+
from the `1` value -- that is, the representation value for non-missing values
121+
is written as the value minus `1`.
122+
123+
It may be tempting to think to avoid this by using nullables, for instance,
124+
`uint?` instead of `uint`, since `default(uint?)` is `null`, a perfectly
125+
intuitive missing value. However, since this has some performance and space
126+
implications, and so many critical transformers use this as an intermediate
127+
format for featurization, the decision was, that the performance gain we get
128+
from not using nullables justified this modest bit of extra complexity. Note
129+
however, that if you take a key-value with representation type `uint` and map
130+
it to an `uint?` through operations like `MLContext.Data.CreateEnumerable`, it
131+
will perform this more intuitive mapping.
132+
133+
## As an Enumeration of a Set: `KeyValues` Annotation
134+
135+
Since keys being an enumeration of some underlying set, there is often a
136+
collection holding those items. This is expressed through the `KeyValues`
137+
annotation kind. Note that this annotation is not part of the
138+
`KeyDataViewType` structure itself, but rather the annotations of the column
139+
with that type, as accessible through the `DataViewSchema.Column` extension
140+
methods `HasKeyValues` and `GetKeyValues`.
141+
142+
Practically, the type of this is most often a vector of text. However, other
143+
types are possible, and when `ValueToKeyMappingEstimator.Fit` is applied to an
144+
input column with some item type, the resulting annotation type would be a
145+
vector of that input item type. So if you were to apply it to a
146+
`NumberDataViewType.Int32` column, you'd have a vector of
147+
`NumberDataViewType.Int32` annotations.
148+
149+
How this annotation is used downstream depends on the purposes of who is
150+
consuming it, but common uses are, in multiclass classification, for
104151
determining the human readable class names, or if used in featurization,
105-
determining the names of the features.
106-
107-
Note that `KeyValues` data is optional, and sometimes is not even sensible.
108-
For example, if we consider a clustering algorithm, the prediction of the
109-
cluster of an example would. So for example, if there were five clusters, then
110-
the prediction would indicate the cluster by `U4<0-4>`. Yet, these clusters
111-
were found by the algorithm itself, and they have no natural descriptions.
112-
113-
## Actual Implementation
114-
115-
This may be of use only to writers or extenders of ML.NET, or users of our
116-
API. How key values are presented *logically* to users of ML.NET, is distinct
117-
from how they are actually stored *physically* in actual memory, both in
118-
ML.NET source and through the API. For key values:
119-
120-
* All key values are stored in unsigned integers.
121-
* The missing key values is always stored as `0`. See the note above about the
122-
default value, to see why this must be so.
123-
* Valid non-missing key values are stored from `1`, onwards, irrespective of
124-
whatever we claim in the key type that minimum value is.
125-
126-
So when, in the prior example, the term transform would map `apple`, `pear`,
127-
and `orange` seemingly to `0`, `1`, and `2`, values of `U4<0-2>`, in reality,
128-
if you were to fire up the debugger you would see that they were stored with
129-
`1`, `2`, and `3`, with unrecognized values being mapped to the "default"
130-
missing value of `0`.
131-
132-
Nevertheless, we almost never talk about this, no more than we would talk
133-
about our "strings" really being implemented as string slices: this is purely
134-
an implementation detail, relevant only to people working with key values at
135-
the source level. To a regular non-API user of ML.NET, key values appear
136-
*externally* to be simply values, just as strings appear to be simply strings,
137-
and so forth.
138-
139-
There is another implication: a hypothetical type `U1<4000-4002>` is actually
140-
a sensible type in this scheme. The `U1` indicates that is stored in one byte,
141-
which would on first glance seem to conflict with values like `4000`, but
142-
remember that the first valid key-value is stored as `1`, and we've identified
143-
the valid range as spanning the three values 4000 through 4002. That is,
144-
`4000` would be represented physically as `1`.
145-
146-
The reality cannot be seen by any conventional means I am aware of, save for
147-
viewing ML.NET's workings in the debugger or using the API and inspecting
148-
these raw values yourself: that `4000` you would see is really stored as the
149-
`byte` `1`, `4001` as `2`, `4002` as `3`, and a missing value stored as `0`.
152+
determining the names of the features, or part of the names of the features.
153+
154+
Note that `KeyValues` kind annotation data is optional, since it is not always
155+
sensible to have specific values in all cases where key values are
156+
appropriate. For example, consider the output of the `k`-means clustering
157+
algorithm. If there were five clusters, then the prediction would indicate the
158+
cluster by a value with key-type of count five. Yet, there is no "value"
159+
associated with each key.
160+
161+
Another example is hash based featurization: if you apply, say, a 10-bit hash,
162+
you know you're enumerating into a set of 1024 values, so a key type is
163+
appropriate. However, because it's a hash you don't have any particular
164+
"original values" associated with it.

0 commit comments

Comments
 (0)