@@ -123,10 +123,10 @@ exclude fields that either contain erroneous data or describe the
123
123
`dependent_variable`.
124
124
.. Choose a training percent of `90` which means it randomly selects 90% of the
125
125
source data for training.
126
- .. If you want to experiment with <<ml-feature-importance,feature importance >>,
127
- specify a value in the advanced configuration options. In this example, we
128
- choose to return a maximum of 5 feature importance values per document. This
129
- option affects the speed of the analysis, so by default it is disabled.
126
+ .. If you want to experiment with <<ml-feature-importance,{feat-imp} >>, specify
127
+ a value in the advanced configuration options. In this example, we choose to
128
+ return a maximum of 5 {feat-imp} values per document. This option affects the
129
+ speed of the analysis, so by default it is disabled.
130
130
.. Use the default memory limit for the job. If the job requires more than this
131
131
amount of memory, it fails to start. If the available memory on the node is
132
132
limited, this setting makes it possible to prevent job execution.
@@ -329,16 +329,24 @@ table to show only testing or training data and you can select which fields are
329
329
shown in the table. You can also enable histogram charts to get a better
330
330
understanding of the distribution of values in your data.
331
331
332
- If you chose to calculate feature importance, the destination index also
333
- contains `ml.feature_importance` objects. Every field that is included in the
334
- {reganalysis} (known as a _feature_ of the data point) is assigned a feature
335
- importance value. However, only the most significant values (in this case, the
336
- top 5) are stored in the index. These values indicate which features had the
337
- biggest (positive or negative) impact on each prediction. In {kib}, you can see
338
- this information displayed in the form of a decision plot:
332
+ If you chose to calculate {feat-imp}, the destination index also contains
333
+ `ml.feature_importance` objects. Every field that is included in the
334
+ {reganalysis} (known as a _feature_ of the data point) is assigned a {feat-imp}
335
+ value. This value has both a magnitude and a direction (positive or negative),
336
+ which indicates how each field affects a particular prediction. Only the most
337
+ significant values (in this case, the top 5) are stored in the index. However,
338
+ the trained model metadata also contains the average magnitude of the {feat-imp}
339
+ values for each field across all the training data. You can view this
340
+ summarized information in {kib}:
339
341
340
342
[role="screenshot"]
341
- image::images/flights-regression-importance.png["A decision plot for feature importance values in {kib}"]
343
+ image::images/flights-regression-total-importance.png["Total {feat-imp} values in {kib}"]
344
+
345
+ You can also see the {feat-imp} values for each individual prediction in the
346
+ form of a decision plot:
347
+
348
+ [role="screenshot"]
349
+ image::images/flights-regression-importance.png["A decision plot for {feat-imp} values in {kib}"]
342
350
343
351
The decision path starts at a baseline, which is the average of the predictions
344
352
for all the data points in the training data set. From there, the feature
@@ -350,12 +358,60 @@ delay. This type of information can help you to understand how models arrive at
350
358
their predictions. It can also indicate which aspects of your data set are most
351
359
influential or least useful when you are training and tuning your model.
352
360
353
- If you do not use {kib}, you can see the same information by using the standard
354
- {es} search command to view the results in the destination index.
361
+ If you do not use {kib}, you can see summarized {feat-imp} values by using the
362
+ {ref}/get-inference.html[get trained model API] and the individual values by
363
+ searching the destination index.
355
364
356
365
.API example
357
366
[%collapsible]
358
367
====
368
+ [source,console]
369
+ --------------------------------------------------
370
+ GET _ml/inference/model-flight-delays*?include=total_feature_importance
371
+ --------------------------------------------------
372
+ // TEST[skip:TBD]
373
+
374
+ The snippet below shows an example of the total feature importance details in
375
+ the trained model metadata:
376
+
377
+ [source,console-result]
378
+ ----
379
+ {
380
+ "count" : 1,
381
+ "trained_model_configs" : [
382
+ {
383
+ "model_id" : "model-flight-delays-1601312043770",
384
+ ...
385
+ "metadata" : {
386
+ ...
387
+ "total_feature_importance" : [
388
+ {
389
+ "feature_name" : "dayOfWeek",
390
+ "importance" : {
391
+ "mean_magnitude" : 0.38674590521018903, <1>
392
+ "min" : -9.42823116446923, <2>
393
+ "max" : 8.707461689065173 <3>
394
+ }
395
+ },
396
+ {
397
+ "feature_name" : "OriginWeather",
398
+ "importance" : {
399
+ "mean_magnitude" : 0.18548393012368913,
400
+ "min" : -9.079576266629092,
401
+ "max" : 5.142479101907649
402
+ }
403
+ ...
404
+ ----
405
+ <1> This value is the average of the absolute {feat-imp} values for the
406
+ `dayOfWeek` field across all the training data.
407
+ <2> This value is the minimum {feat-imp} value across all the training data for
408
+ this field.
409
+ <3> This value is the maximum {feat-imp} value across all the training data for
410
+ this field.
411
+
412
+ To see the top {feat-imp} values for each prediction, search the destination
413
+ index. For example:
414
+
359
415
[source,console]
360
416
--------------------------------------------------
361
417
GET model-flight-delays/_search
@@ -399,6 +455,7 @@ The snippet below shows a part of a document with the annotated results:
399
455
}
400
456
...
401
457
----
458
+
402
459
====
403
460
404
461
[[flightdata-regression-evaluate]]
0 commit comments