|
95 | 95 | <para>Generally, ensemble models provide better coverage and accuracy than single decision trees.
|
96 | 96 | Each tree in a decision forest outputs a Gaussian distribution.</para>
|
97 | 97 | <para>For more see: </para>
|
98 |
| - <list> |
| 98 | + <list type='bullet'> |
99 | 99 | <item><description><a href='http://en.wikipedia.org/wiki/Random_forest'>Wikipedia: Random forest</a></description></item>
|
100 | 100 | <item><description><a href='http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf'>Quantile regression forest</a></description></item>
|
101 | 101 | <item><description><a href='https://blogs.technet.microsoft.com/machinelearning/2014/09/10/from-stumps-to-trees-to-forests/'>From Stumps to Trees to Forests</a></description></item>
|
|
146 | 146 | <summary>
|
147 | 147 | Trains a tree ensemble, or loads it from a file, then maps a numeric feature vector
|
148 | 148 | to three outputs:
|
149 |
| - <list> |
| 149 | + <list type='number'> |
150 | 150 | <item><description>A vector containing the individual tree outputs of the tree ensemble.</description></item>
|
151 | 151 | <item><description>A vector indicating the leaves that the feature vector falls on in the tree ensemble.</description></item>
|
152 | 152 | <item><description>A vector indicating the paths that the feature vector falls on in the tree ensemble.</description></item>
|
|
157 | 157 | </summary>
|
158 | 158 | <remarks>
|
159 | 159 | In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features.
|
160 |
| - <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features, |
| 160 | + <para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features, |
161 | 161 | and use the cluster distances as the new feature set.
|
162 |
| - Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> |
| 162 | + Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para> |
163 | 163 | There are a number of famous or popular examples of this technique:
|
164 |
| - <list> |
165 |
| - <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. |
166 |
| - It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, |
| 164 | + <list type='bullet'> |
| 165 | + <item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'. |
| 166 | + It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together, |
167 | 167 | and far away from pictures of kittens. </description></item>
|
168 |
| - <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> |
169 |
| - <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, |
170 |
| - and there's no reason to compute them. </description></item> |
| 168 | + <item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item> |
| 169 | + <item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model, |
| 170 | + and there's no reason to compute them. </description></item> |
171 | 171 | </list>
|
172 | 172 | <para>Tree featurizer uses the decision tree ensembles for feature engineering in the same fashion as above.</para>
|
173 |
| - <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training). |
| 173 | + <para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training). |
174 | 174 | If we associate each leaf of each tree with a sequential integer, we can, for every incoming example x,
|
175 |
| - produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> |
| 175 | + produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para> |
176 | 176 | <para>Thus, for every example x, we produce a 10000-valued vector L, with exactly 100 1s and the rest zeroes.
|
177 |
| - This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> |
178 |
| - <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> |
| 177 | + This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para> |
| 178 | + <para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para> |
179 | 179 | <para>We could repeat the same thought process for the non-leaf, or internal, nodes of the trees (we know that each tree has exactly 99 of them in our 100-leaf example),
|
180 |
| - and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> |
181 |
| - <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> |
| 180 | + and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para> |
| 181 | + <para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para> |
182 | 182 | <para>The TreeLeafFeaturizer is also producing the third vector, T, which is defined as Ti(x) = output of tree #i on example x.</para>
|
183 | 183 | </remarks>
|
184 | 184 | <example>
|
|
0 commit comments