|
1 | 1 | # Key Values
|
2 | 2 |
|
3 |
| -Most commonly, key-values are used to encode items where it is convenient or |
4 |
| -efficient to represent values using numbers, but you want to maintain the |
5 |
| -logical "idea" that these numbers are keys indexing some underlying, implicit |
6 |
| -set of values, in a way more explicit than simply mapping to a number would |
7 |
| -allow you to do. |
| 3 | +Most commonly, in key-valued data, each value takes one of a limited number of |
| 4 | +distinct values. We might view them as being the enumeration into a set. They |
| 5 | +are represented in memory using unsigned integers. Most commonly this is |
| 6 | +`uint`, but `byte`, `ushort`, and `ulong` are possible values to use as well. |
8 | 7 |
|
9 | 8 | A more formal description of key values and types is
|
10 | 9 | [here](IDataViewTypeSystem.md#key-types). *This* document's motivation is less
|
11 | 10 | to describe what key types and values are, and more to instead describe why
|
12 | 11 | key types are necessary and helpful things to have. Necessarily, this document,
|
13 | 12 | is more anecdotal in its descriptions to motivate its content.
|
14 | 13 |
|
15 |
| -Let's take a few examples of transforms that produce keys: |
| 14 | +Let's take a few examples of transformers that produce keys: |
16 | 15 |
|
17 |
| -* The `TermTransform` forms a dictionary of unique observed values to a key. |
18 |
| - The key type's count indicates the number of items in the set, and through |
19 |
| - the `KeyValue` metadata "remembers" what each key is representing. |
| 16 | +* The `ValueToKeyMappingTransformer` has a dictionary of unique values |
| 17 | + obvserved when it was fit, each mapped to a key-value. The key type's count |
| 18 | + indicates the number of items in the set, and through the `KeyValue` |
| 19 | + annotation "remembers" what each key is representing. |
20 | 20 |
|
21 |
| -* The `HashTransform` performs a hash of input values, and produces a key |
22 |
| - value with count equal to the range of the hash function, which, if a b bit |
23 |
| - hash was used, will produce a 2ᵇ hash. |
| 21 | +* The `TokenizingByCharactersTransformer` will take input strings and produce |
| 22 | + key values representing the characters observed in the string. The |
| 23 | + `KeyValue` annotation "remembers" what each key is representing. (Note that |
| 24 | + unlike many other key-valued operations, this uses a representation type of |
| 25 | + `ushort` instead of `uint`.) |
24 | 26 |
|
25 |
| -* The `CharTokenizeTransform` will take input strings and produce key values |
26 |
| - representing the characters observed in the string. |
| 27 | +* The `HashingTransformer` performs a hash of input values, and produces a key |
| 28 | + value with count equal to the range of the hash function, which, if a `b` |
| 29 | + bit hash was used, will produce values with a key-type of count `2ᵇ` . |
| 30 | + |
| 31 | +Note that in the first two cases, these are enumerating into a set with actual |
| 32 | +specific values, whereas in the last case we are also enumerating into a set, |
| 33 | +but one without values, since hashes don't intrinsically correspond to a |
| 34 | +single item. |
27 | 35 |
|
28 | 36 | ## Keys as Intermediate Values
|
29 | 37 |
|
30 | 38 | Explicitly invoking transforms that produce key values, and using those key
|
31 |
| -values, is sometimes helpful. However, given that most trainers expect the |
32 |
| -feature vector to be a vector of floating point values and *not* keys, in |
| 39 | +values, is sometimes helpful. However, given that trainers typically expect |
| 40 | +the feature vector to be a vector of floating point values and *not* keys, in |
33 | 41 | typical usage the majority of usages of keys is as some sort of intermediate
|
34 | 42 | value on the way to that final feature vector. (Unless, say, doing something
|
35 |
| -like preparing labels for a multiclass learner.) |
36 |
| - |
37 |
| -So why not go directly to the feature vector, and forget this key stuff? |
38 |
| -Actually, to take text as the canonical example, we used to. However, by |
39 |
| -structuring the transforms from, say, text to key to vector, rather than text |
40 |
| -to vector *directly*, we are able to simplify a lot of code on the |
41 |
| -implementation side, which is both less for us to maintain, and also for users |
42 |
| -gives consistency in behavior. |
43 |
| - |
44 |
| -So for example, the `CharTokenize` above might appear to be a strange choice: |
45 |
| -*why* represent characters as keys? The reason is that the ngram transform is |
46 |
| -written to ingest keys, not text, and so we can use the same transform for |
47 |
| -both the n-gram featurization of words, as well as n-char grams. |
| 43 | +like preparing labels for a multiclass trainer.) |
| 44 | + |
| 45 | +So why not go directly to the feature vector from whatever the input was, and |
| 46 | +forget this key stuff? Actually, to take text processing as the canonical |
| 47 | +example, we used to. However, by structuring the transforms from, say, text to |
| 48 | +key to vector, rather than text to vector *directly*, we were able to make a |
| 49 | +more flexible pipeline, and re-use smaller, simpler components. Having |
| 50 | +multiple composable transformers instead of one "omni-bus" transformer that |
| 51 | +does everything makes the process easier to understand, maintain, and exploit |
| 52 | +for novel purposes, while giving people greater visibility into the |
| 53 | +composability of what actually happens. |
| 54 | + |
| 55 | +So for example, the `TokenizingByCharactersTransformer` above might appear to |
| 56 | +be a strange choice: *why* represent characters as keys? The reason is that |
| 57 | +the ngram transform which often comes after it, is written to ingest keys, not |
| 58 | +text, and so we can use the same transform for both the n-gram featurization |
| 59 | +of words, as well as n-char grams. |
48 | 60 |
|
49 | 61 | Now, much of this complexity is hidden from the user: most users will just use
|
50 |
| -the `text` transform, select some options for n-grams, and chargrams, and not |
51 |
| -be aware of these internal invisible keys. Similarly, use the categorical or |
52 |
| -categorical hash transforms, without knowing that internally it is just the |
53 |
| -term or hash transform followed by a `KeyToVector` transform. But, keys are |
54 |
| -still there, and it would be impossible to really understand ML.NET's |
55 |
| -featurization pipeline without understanding keys. Any user that wants to |
56 |
| -understand how, say, the text transform resulted in a particular featurization |
| 62 | +the text featurization transform, select some options for n-grams, and |
| 63 | +chargrams, and not necessarily have to be aware of the usage of these internal |
| 64 | +keys, at least. Similarly, this user can use the categorical or categorical |
| 65 | +hash transforms, without knowing that internally it is just the term or hash |
| 66 | +transform followed by a `KeyToVectorMappingTransformer`. But, keys are still |
| 67 | +there, and it would be impossible to really understand ML.NET's featurization |
| 68 | +pipeline without understanding keys. Any user that wants to debug how, say, |
| 69 | +the text transform's multiple steps resulted in a particular featurization |
57 | 70 | will have to inspect the key values to get that understanding.
|
58 | 71 |
|
59 |
| -## Keys are not Numbers |
| 72 | +## The Representation of Keys |
60 | 73 |
|
61 | 74 | As an actual CLR data type, key values are stored as some form of unsigned
|
62 |
| -integer (most commonly `uint`). The most common confusion that arises from |
63 |
| -this is to ascribe too much importance to the fact that it is a `uint`, and |
64 |
| -think these are somehow just numbers. This is incorrect. |
| 75 | +integer (most commonly `uint`, but the other unsigned integer types are legal |
| 76 | +as well). One common confusion that arises from this is to ascribe too much |
| 77 | +importance to the fact that it is a `uint`, and think these are somehow just |
| 78 | +numbers. This is incorrect. |
| 79 | + |
| 80 | +Most importantly, that the cardinality of the set they're enumerating is part |
| 81 | +of the type is critical information. In an `IDataView`, these are represented |
| 82 | +by the `KeyDataViewType` (or a vector of those types), with `RawType` being |
| 83 | +one of the aforementioned .NET unsigned numeric types, and most critically |
| 84 | +`Count` holding the cardinality of the set being represented. By encoding this |
| 85 | +in the schema, one can tell in downstream `ITransformer`s. |
65 | 86 |
|
66 | 87 | For keys, the concept of order and difference has no inherent, real meaning as
|
67 | 88 | it does for numbers, or at least, the meaning is different and highly domain
|
68 |
| -dependent. Consider a numeric `U4` type, with values `0`, `1`, and `2`. The |
69 |
| -difference between `0` and `1` is `1`, and the difference between `1` and `2` |
70 |
| -is `1`, because they're numbers. Very well: now consider that you train a term |
71 |
| -transform over the input tokens `apple`, `pear`, and `orange`: this will also |
72 |
| -map to the keys logically represented as the numbers `0`, `1`, and `2` |
73 |
| -respectively. Yet for a key, is the difference between keys `0` and `1`, `1`? |
| 89 | +dependent. Consider a numeric `uint` type (specifically, |
| 90 | +`NumberDataViewType.UInt32`), with values `0`, `1`, and `2`. The difference |
| 91 | +between `0` and `1` is `1`, and the difference between `1` and `2` is `1`, |
| 92 | +because they're numbers. Very well: now consider that you call |
| 93 | +`ValueToKeyMappingEstimator.Fit` to get the transformer over the input tokens |
| 94 | +`apple`, `pear`, and `orange`: this will also map to the keys physically |
| 95 | +represented as the `uint`s `1`, `2`, and `3` respectively, which corresponds |
| 96 | +to the logical ordinal indices of `0`, `1`, and `2`, again respectively. |
| 97 | + |
| 98 | +Yet for a key, is the difference between the logical indices `0` and `1`, `1`? |
74 | 99 | No, the difference is `0` maps to `apple` and `1` to `pear`. Also order
|
75 |
| -doesn't mean one key is somehow "larger," it just means we saw one before |
76 |
| -another -- or something else, if sorting by value happened to be selected. |
77 |
| - |
78 |
| -Also: ML.NET's vectors can be sparse. Implicit entries in a sparse vector are |
79 |
| -assumed to have the `default` value for that type -- that is, implicit values |
80 |
| -for numeric types will be zero. But what would be the implicit default value |
81 |
| -for a key value be? Take the `apple`, `pear`, and `orange` example above -- it |
82 |
| -would inappropriate for the default value to be `0`, because that means the |
83 |
| -result is `apple`, would be appropriate. The only really appropriate "default" |
84 |
| -choice is that the value is unknown, that is, missing. |
85 |
| - |
86 |
| -An implication of this is that there is a distinction between the logical |
87 |
| -value of a key-value, and the actual physical value of the value in the |
88 |
| -underlying type. This will be covered more later. |
89 |
| - |
90 |
| -## As an Enumeration of a Set: `KeyValues` Metadata |
91 |
| - |
92 |
| -While keys can be used for many purposes, they are often used to enumerate |
93 |
| -items from some underlying set. In order to map keys back to this original |
94 |
| -set, many transform producing key values will also produce `KeyValues` |
95 |
| -metadata associated with that output column. |
96 |
| - |
97 |
| -Valid `KeyValues` metadata is a vector of length equal to the count of the |
98 |
| -type of the column. This can be of varying types: it is often text, but does |
99 |
| -not need to be. For example, a `term` applied to a column would have |
100 |
| -`KeyValue` metadata of item type equal to the item type of the input data. |
101 |
| - |
102 |
| -How this metadata is used downstream depends on the purposes of who is |
103 |
| -consuming it, but common uses are: in multiclass classification, for |
| 100 | +doesn't mean one key is somehow "larger," it just sometimes means we saw one |
| 101 | +before another -- or something else, if sorting by value happened to be |
| 102 | +selected, or if the dictionary was constructed in some other fashion. |
| 103 | + |
| 104 | +There's also the matter of default values. For key values, the default key |
| 105 | +value should be the "missing" value for they key. So logically, `0` is the |
| 106 | +missing value for any key type. The alternative is that the default value |
| 107 | +would be whatever key value happened to correspond to the "first" key value, |
| 108 | +which would be very strange and unnatural. Consider the `apple`, `pear`, and |
| 109 | +`orange` example above -- it would be inappropriate for the default value to |
| 110 | +be `apple`, since that's fairly arbitrary. Or, to extend this reasoning to |
| 111 | +sparse `VBuffer`s, would it be appropriate for a sparse `VBuffer` of key to |
| 112 | +have a value of `apple` for every implicit value? That doesn't make sense. So, |
| 113 | +the default value is the missing value. |
| 114 | + |
| 115 | +One of the more confusing consequences of this is that since, practically, |
| 116 | +these key values are more often than not used as indices of one form or |
| 117 | +another, and the first non-missing value is `1`, that in certain circumstances |
| 118 | +like, say, writing out key values to text, that non-missing values will be |
| 119 | +written out starting at `0`, even though physically they are stored starting |
| 120 | +from the `1` value -- that is, the representation value for non-missing values |
| 121 | +is written as the value minus `1`. |
| 122 | + |
| 123 | +It may be tempting to think to avoid this by using nullables, for instance, |
| 124 | +`uint?` instead of `uint`, since `default(uint?)` is `null`, a perfectly |
| 125 | +intuitive missing value. However, since this has some performance and space |
| 126 | +implications, and so many critical transformers use this as an intermediate |
| 127 | +format for featurization, the decision was, that the performance gain we get |
| 128 | +from not using nullables justified this modest bit of extra complexity. Note |
| 129 | +however, that if you take a key-value with representation type `uint` and map |
| 130 | +it to an `uint?` through operations like `MLContext.Data.CreateEnumerable`, it |
| 131 | +will perform this more intuitive mapping. |
| 132 | + |
| 133 | +## As an Enumeration of a Set: `KeyValues` Annotation |
| 134 | + |
| 135 | +Since keys being an enumeration of some underlying set, there is often a |
| 136 | +collection holding those items. This is expressed through the `KeyValues` |
| 137 | +annotation kind. Note that this annotation is not part of the |
| 138 | +`KeyDataViewType` structure itself, but rather the annotations of the column |
| 139 | +with that type, as accessible through the `DataViewSchema.Column` extension |
| 140 | +methods `HasKeyValues` and `GetKeyValues`. |
| 141 | + |
| 142 | +Practically, the type of this is most often a vector of text. However, other |
| 143 | +types are possible, and when `ValueToKeyMappingEstimator.Fit` is applied to an |
| 144 | +input column with some item type, the resulting annotation type would be a |
| 145 | +vector of that input item type. So if you were to apply it to a |
| 146 | +`NumberDataViewType.Int32` column, you'd have a vector of |
| 147 | +`NumberDataViewType.Int32` annotations. |
| 148 | + |
| 149 | +How this annotation is used downstream depends on the purposes of who is |
| 150 | +consuming it, but common uses are, in multiclass classification, for |
104 | 151 | determining the human readable class names, or if used in featurization,
|
105 |
| -determining the names of the features. |
106 |
| - |
107 |
| -Note that `KeyValues` data is optional, and sometimes is not even sensible. |
108 |
| -For example, if we consider a clustering algorithm, the prediction of the |
109 |
| -cluster of an example would. So for example, if there were five clusters, then |
110 |
| -the prediction would indicate the cluster by `U4<0-4>`. Yet, these clusters |
111 |
| -were found by the algorithm itself, and they have no natural descriptions. |
112 |
| - |
113 |
| -## Actual Implementation |
114 |
| - |
115 |
| -This may be of use only to writers or extenders of ML.NET, or users of our |
116 |
| -API. How key values are presented *logically* to users of ML.NET, is distinct |
117 |
| -from how they are actually stored *physically* in actual memory, both in |
118 |
| -ML.NET source and through the API. For key values: |
119 |
| - |
120 |
| -* All key values are stored in unsigned integers. |
121 |
| -* The missing key values is always stored as `0`. See the note above about the |
122 |
| - default value, to see why this must be so. |
123 |
| -* Valid non-missing key values are stored from `1`, onwards, irrespective of |
124 |
| -whatever we claim in the key type that minimum value is. |
125 |
| - |
126 |
| -So when, in the prior example, the term transform would map `apple`, `pear`, |
127 |
| -and `orange` seemingly to `0`, `1`, and `2`, values of `U4<0-2>`, in reality, |
128 |
| -if you were to fire up the debugger you would see that they were stored with |
129 |
| -`1`, `2`, and `3`, with unrecognized values being mapped to the "default" |
130 |
| -missing value of `0`. |
131 |
| - |
132 |
| -Nevertheless, we almost never talk about this, no more than we would talk |
133 |
| -about our "strings" really being implemented as string slices: this is purely |
134 |
| -an implementation detail, relevant only to people working with key values at |
135 |
| -the source level. To a regular non-API user of ML.NET, key values appear |
136 |
| -*externally* to be simply values, just as strings appear to be simply strings, |
137 |
| -and so forth. |
138 |
| - |
139 |
| -There is another implication: a hypothetical type `U1<4000-4002>` is actually |
140 |
| -a sensible type in this scheme. The `U1` indicates that is stored in one byte, |
141 |
| -which would on first glance seem to conflict with values like `4000`, but |
142 |
| -remember that the first valid key-value is stored as `1`, and we've identified |
143 |
| -the valid range as spanning the three values 4000 through 4002. That is, |
144 |
| -`4000` would be represented physically as `1`. |
145 |
| - |
146 |
| -The reality cannot be seen by any conventional means I am aware of, save for |
147 |
| -viewing ML.NET's workings in the debugger or using the API and inspecting |
148 |
| -these raw values yourself: that `4000` you would see is really stored as the |
149 |
| -`byte` `1`, `4001` as `2`, `4002` as `3`, and a missing value stored as `0`. |
| 152 | +determining the names of the features, or part of the names of the features. |
| 153 | + |
| 154 | +Note that `KeyValues` kind annotation data is optional, since it is not always |
| 155 | +sensible to have specific values in all cases where key values are |
| 156 | +appropriate. For example, consider the output of the `k`-means clustering |
| 157 | +algorithm. If there were five clusters, then the prediction would indicate the |
| 158 | +cluster by a value with key-type of count five. Yet, there is no "value" |
| 159 | +associated with each key. |
| 160 | + |
| 161 | +Another example is hash based featurization: if you apply, say, a 10-bit hash, |
| 162 | +you know you're enumerating into a set of 1024 values, so a key type is |
| 163 | +appropriate. However, because it's a hash you don't have any particular |
| 164 | +"original values" associated with it. |
0 commit comments