Skip to content

Commit a0d4355

Browse files
authored
[DOCS] Adds feature importance regression example (#1360) (#1361)
1 parent 9685023 commit a0d4355

File tree

6 files changed

+90
-45
lines changed

6 files changed

+90
-45
lines changed

docs/en/stack/ml/df-analytics/flightdata-regression.asciidoc

Lines changed: 90 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -106,9 +106,13 @@ image::images/flights-regression-job-1.png["Creating a {dfanalytics-job} in {kib
106106
[role="screenshot"]
107107
image::images/flights-regression-job-2.png["Creating a {dfanalytics-job} in {kib}" – continued]
108108

109+
[role="screenshot"]
110+
image::images/flights-regression-job-3.png["Creating a {dfanalytics-job} in {kib}" – advanced options]
111+
109112

110113
.. Choose `kibana_sample_data_flights` as the source index.
111114
.. Choose `regression` as the job type.
115+
.. Optionally improve the quality of the analysis by adding a query that removes erroneous data. In this case, we omit flights with a distance of 0 kilometers or less.
112116
.. Choose `FlightDelayMin` as the dependent variable, which is the field that we
113117
want to predict with the {reganalysis}.
114118
.. Add `Cancelled`, `FlightDelay`, and `FlightDelayType` to the list of excluded
@@ -117,16 +121,18 @@ exclude fields that either contain erroneous data or describe the
117121
`dependent_variable`.
118122
.. Choose a training percent of `90` which means it randomly selects 90% of the
119123
source data for training.
120-
.. Use the default feature importance values.
124+
.. If you want to experiment with <<ml-feature-importance,feature importance>>,
125+
specify a value in the advanced configuration options. In this example, we
126+
choose to return a maximum of 5 feature importance values per document. This
127+
option affects the speed of the analysis, so by default it is disabled.
121128
.. Use the default memory limit for the job. If the job requires more than this
122129
amount of memory, it fails to start. If the available memory on the node is
123130
limited, this setting makes it possible to prevent job execution.
124131
.. Add a job ID and optionally a job description.
125132
.. Add the name of the destination index that will contain the results of the
126-
analysis. It will contain a copy of the source index data where each document is
127-
annotated with the results. If the index does not exist, it will be created
128-
automatically.
129-
133+
analysis. In {kib}, the index name matches the job ID by default. It will
134+
contain a copy of the source index data where each document is annotated with
135+
the results. If the index does not exist, it will be created automatically.
130136

131137
.API example
132138
[%collapsible]
@@ -139,7 +145,7 @@ PUT _ml/data_frame/analytics/model-flight-delays
139145
"index": [
140146
"kibana_sample_data_flights"
141147
],
142-
"query": { <1>
148+
"query": {
143149
"range": {
144150
"DistanceKilometers": {
145151
"gt": 0
@@ -148,7 +154,7 @@ PUT _ml/data_frame/analytics/model-flight-delays
148154
}
149155
},
150156
"dest": {
151-
"index": "df-flight-delays"
157+
"index": "model-flight-delays"
152158
},
153159
"analysis": {
154160
"regression": {
@@ -167,9 +173,6 @@ PUT _ml/data_frame/analytics/model-flight-delays
167173
}
168174
--------------------------------------------------
169175
// TEST[skip:setup kibana sample data]
170-
171-
<1> Optional query that removes erroneous data from the analysis to improve
172-
quality.
173176
====
174177
--
175178

@@ -263,37 +266,36 @@ The API call returns the following response:
263266
"skipped_docs_count" : 0
264267
},
265268
"memory_usage" : {
266-
"timestamp" : 1596237978801,
267-
"peak_usage_bytes" : 2204548,
269+
"timestamp" : 1599773614155,
270+
"peak_usage_bytes" : 50156565,
268271
"status" : "ok"
269272
},
270273
"analysis_stats" : {
271274
"regression_stats" : {
272-
"timestamp" : 1596237978801,
275+
"timestamp" : 1599773614155,
273276
"iteration" : 18,
274277
"hyperparameters" : {
275-
"alpha" : 168825.7788898173,
276-
"downsample_factor" : 0.9033277769849748,
277-
"eta" : 0.04884738703731517,
278-
"eta_growth_rate_per_tree" : 1.0299887790757198,
278+
"alpha" : 19042.721566629778,
279+
"downsample_factor" : 0.911884068909842,
280+
"eta" : 0.02331774683318904,
281+
"eta_growth_rate_per_tree" : 1.0143154178910303,
279282
"feature_bag_fraction" : 0.5504020748926737,
280-
"gamma" : 1454.4275926774008,
281-
"lambda" : 2.1114872989215074,
283+
"gamma" : 53.373570122718846,
284+
"lambda" : 2.94058933878574,
282285
"max_attempts_to_add_tree" : 3,
283286
"max_optimization_rounds_per_hyperparameter" : 2,
284-
"max_trees" : 427,
287+
"max_trees" : 894,
285288
"num_folds" : 4,
286289
"num_splits_per_feature" : 75,
287-
"soft_tree_depth_limit" : 5.8014874129785,
290+
"soft_tree_depth_limit" : 2.945317520946171,
288291
"soft_tree_depth_tolerance" : 0.13448633124842999
289292
},
290293
"timing_stats" : {
291-
"elapsed_time" : 124851,
292-
"iteration_time" : 15081
294+
"elapsed_time" : 302959,
295+
"iteration_time" : 13075
293296
},
294297
"validation_loss" : {
295-
"loss_type" : "mse",
296-
"fold_values" : [ ]
298+
"loss_type" : "mse"
297299
}
298300
}
299301
}
@@ -325,6 +327,27 @@ table to show only testing or training data and you can select which fields are
325327
shown in the table. You can also enable histogram charts to get a better
326328
understanding of the distribution of values in your data.
327329

330+
If you chose to calculate feature importance, the destination index also
331+
contains `ml.feature_importance` objects. Every field that is included in the
332+
{reganalysis} (known as a _feature_ of the data point) is assigned a feature
333+
importance value. However, only the most significant values (in this case, the
334+
top 5) are stored in the index. These values indicate which features had the
335+
biggest (positive or negative) impact on each prediction. In {kib}, you can see
336+
this information displayed in the form of a decision plot:
337+
338+
[role="screenshot"]
339+
image::images/flights-regression-importance.png["A decision plot for feature importance values in {kib}"]
340+
341+
The decision path starts at a baseline, which is the average of the predictions
342+
for all the data points in the training data set. From there, the feature
343+
importance values are added to the decision path until it arrives at its final
344+
prediction. The features with the most significant positive or negative impact
345+
appear at the top. Thus in this example, the features related to the flight
346+
distance had the most significant influence on this particular predicted flight
347+
delay. This type of information can help you to understand how models arrive at
348+
their predictions. It can also indicate which aspects of your data set are most
349+
influential or least useful when you are training and tuning your model.
350+
328351
If you do not use {kib}, you can see the same information by using the standard
329352
{es} search command to view the results in the destination index.
330353

@@ -333,7 +356,7 @@ If you do not use {kib}, you can see the same information by using the standard
333356
====
334357
[source,console]
335358
--------------------------------------------------
336-
GET df-flight-delays/_search
359+
GET model-flight-delays/_search
337360
--------------------------------------------------
338361
// TEST[skip:TBD]
339362
@@ -342,13 +365,35 @@ The snippet below shows a part of a document with the annotated results:
342365
[source,console-result]
343366
----
344367
...
345-
"DestCountry" : "GB",
346-
"DestRegion" : "GB-ENG",
347-
"OriginAirportID" : "CAN",
348-
"DestCityName" : "London",
349-
"ml" : {
350-
"FlightDelayMin_prediction" : 10.039840698242188,
351-
"is_training" : true
368+
"DestCountry" : "CH",
369+
"DestRegion" : "CH-ZH",
370+
"OriginAirportID" : "VIE",
371+
"DestCityName" : "Zurich",
372+
"ml": {
373+
"FlightDelayMin_prediction": 277.5392150878906,
374+
"feature_importance": [
375+
{
376+
"feature_name": "DestCityName",
377+
"importance": 0.6285966753441136
378+
},
379+
{
380+
"feature_name": "DistanceKilometers",
381+
"importance": 84.4982943868267
382+
},
383+
{
384+
"feature_name": "DistanceMiles",
385+
"importance": 103.90011847132116
386+
},
387+
{
388+
"feature_name": "FlightTimeHour",
389+
"importance": 3.7119156097309345
390+
},
391+
{
392+
"feature_name": "FlightTimeMin",
393+
"importance": 38.700587425831365
394+
}
395+
],
396+
"is_training": true
352397
}
353398
...
354399
----
@@ -385,16 +430,16 @@ You can alternatively generate these metrics with the
385430
--------------------------------------------------
386431
POST _ml/data_frame/_evaluate
387432
{
388-
"index": "df-flight-delays", <1>
433+
"index": "model-flight-delays",
389434
"query": {
390435
"bool": {
391-
"filter": [{ "term": { "ml.is_training": true } }] <2>
436+
"filter": [{ "term": { "ml.is_training": true } }] <1>
392437
}
393438
},
394439
"evaluation": {
395440
"regression": {
396-
"actual_field": "FlightDelayMin", <3>
397-
"predicted_field": "ml.FlightDelayMin_prediction", <4>
441+
"actual_field": "FlightDelayMin", <2>
442+
"predicted_field": "ml.FlightDelayMin_prediction", <3>
398443
"metrics": {
399444
"r_squared": {},
400445
"mse": {}
@@ -405,10 +450,9 @@ POST _ml/data_frame/_evaluate
405450
--------------------------------------------------
406451
// TEST[skip:TBD]
407452
408-
<1> The destination index which is the output of the {dfanalytics-job}.
409-
<2> Calculate the training error by evaluating only the training data.
410-
<3> The field that contains the actual (ground truth) value.
411-
<4> The field that contains the predicted value.
453+
<1> Calculate the training error by evaluating only the training data.
454+
<2> The field that contains the actual (ground truth) value.
455+
<3> The field that contains the predicted value.
412456
413457
The API returns a response like this:
414458
@@ -417,10 +461,10 @@ The API returns a response like this:
417461
{
418462
"regression" : {
419463
"mse" : {
420-
"value" : 3125.3396943667544
464+
"value" : 2604.920215688451
421465
},
422466
"r_squared" : {
423-
"value" : 0.6659988649180306
467+
"value" : 0.7162091232654141
424468
}
425469
}
426470
}
@@ -432,7 +476,7 @@ Next, we calculate the generalization error:
432476
--------------------------------------------------
433477
POST _ml/data_frame/_evaluate
434478
{
435-
"index": "df-flight-delays",
479+
"index": "model-flight-delays",
436480
"query": {
437481
"bool": {
438482
"filter": [{ "term": { "ml.is_training": false } }] <1>
@@ -460,4 +504,5 @@ about new data. Those steps are not covered in this example. See
460504

461505
If you don't want to keep the {dfanalytics-job}, you can delete it. For example,
462506
use {kib} or the {ref}/delete-dfanalytics.html[delete {dfanalytics-job} API].
463-
When you delete {dfanalytics-jobs}, the destination indices remain intact.
507+
When you delete {dfanalytics-jobs} in {kib}, you have the option to also remove
508+
the destination indices and index patterns.
-28.8 KB
Loading
324 KB
Loading
-6.92 KB
Loading
115 KB
Loading
-26.5 KB
Loading

0 commit comments

Comments
 (0)