XGBoost autologging: support per-class importance plots #4523

dbczumar · 2021-07-01T22:29:33Z

What changes are proposed in this pull request?

XGBoost 1.15.0-dev introduced support for importance computation on linear estimators. These estimators return importance values for each (feature, class) pair as a num_features-by-num_classes matrix. This PR introduces extends feature importance plotting support in XGBoost autologging to handle this new importance value format.

How is this patch tested?

Unit tests
Manual tests on XGBoost 1.15.0-dev:

Linear booster training on Iris

from sklearn.datasets import load_iris

iris = load_iris()

import mlflow
mlflow.xgboost.autolog()


import xgboost as xgb
dtrain = xgb.DMatrix(iris.data, label=iris.target)

bst_params = {"objective": "multi:softprob", "num_class": 3, "booster": "gblinear"}
model = xgb.train(bst_params, dtrain)

Tree booster training on Iris

from sklearn.datasets import load_iris

iris = load_iris()

import mlflow
mlflow.xgboost.autolog()


import xgboost as xgb
dtrain = xgb.DMatrix(iris.data, label=iris.target)

bst_params = {
        "objective": "multi:softprob",
        "num_class": 10,
        "eval_metric": "mlogloss",
        "booster": "gbtree",
    }
model = xgb.train(bst_params, dtrain)

Linear booster training on MNIST

from sklearn.datasets import load_digits

digits = load_digits()

import mlflow
mlflow.xgboost.autolog()


import xgboost as xgb
dtrain = xgb.DMatrix(digits.data, label=digits.target)

bst_params = {"objective": "multi:softprob", "num_class": 10, "booster": "gblinear"}
model = xgb.train(bst_params, dtrain)

Tree booster training on MNIST

from sklearn.datasets import load_digits

digits = load_digits()

import mlflow
mlflow.xgboost.autolog()


import xgboost as xgb
dtrain = xgb.DMatrix(digits.data, label=digits.target)

bst_params = {
        "objective": "multi:softprob",
        "num_class": 10,
        "eval_metric": "mlogloss",
        "booster": "gbtree",
    }
model = xgb.train(bst_params, dtrain)

Release Notes

Add XGBoost autologging support for multi-class feature importance plots

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, JavaScript, plotting
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

harupy · 2021-07-02T02:40:18Z

mlflow/xgboost.py

+                importances_per_class_by_feature = np.array(
+                    [[importance] for importance in importances_per_class_by_feature]
+                )


Suggested change

importances_per_class_by_feature = np.array(

[[importance] for importance in importances_per_class_by_feature]

)

importances_per_class_by_feature = np.array(

[[importance] for importance in importances_per_class_by_feature[indices]]

)

Can we sort importance as well?

Thanks for catching this! Done! Here's a screenshot from tree booster on MNIST:

harupy · 2021-07-02T06:30:02Z

mlflow/xgboost.py

+                        feature_yloc + offset,
+                        class_importance,
+                        align="center",
+                        height=(0.5 / num_classes),


Suggested change

height=(0.5 / num_classes),

height=(0.5 / max(num_classes- 1, 1)),

# alternative approaches

height=(0.5 / (num_classes- 1 if num_clalles > 1 else 1)),

height=(0.5 / (num_clalles - 1 or 1)),

Can we divide by num_classes - 1 to remove the gap between bars?

Great suggestion! Done!

harupy · 2021-07-02T08:56:28Z

mlflow/xgboost.py

+                for class_idx, (offset, class_importance) in enumerate(
+                    zip(offsets_per_yloc, importances_per_class)
+                ):
+                    (bar,) = ax.barh(


Nice unpacking :)

harupy · 2021-07-02T09:37:39Z

mlflow/xgboost.py

+            else:
+                label_classes_on_plot = True


Can we sort a 2D importance matrix (that linear boosters generates) as well?

import numpy as np features = np.array(["a", "b", "c"]) importance = [ # class0, class1, class2 [7, 8, 9], # a [4, 5, 6], # b [1, 2, 3], # c ] importances_per_class_by_feature = np.array(importance) abs_sum = np.abs(importances_per_class_by_feature).sum(axis=1) # or abs_mean = np.abs(importances_per_class_by_feature).mean(axis=1) indices = np.argsort(abs_sum) print(importances_per_class_by_feature[indices]) # [[1 2 3] # [4 5 6] # [7 8 9]] print(features[indices]) # ['c' 'b' 'a']

Absolutely! Done! (Chose sum() for the magnitude metric rather than mean()). Here's a screenshot from a linear booster on MNIST:

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar

@harupy Thanks for the awesome review feedback! I've addressed your comments. Can you take another look?

dbczumar · 2021-07-06T07:42:03Z

mlflow/xgboost.py

+                        feature_yloc + offset,
+                        class_importance,
+                        align="center",
+                        height=(0.5 / num_classes),


Great suggestion! Done!

dbczumar · 2021-07-06T07:44:17Z

mlflow/xgboost.py

+                importances_per_class_by_feature = np.array(
+                    [[importance] for importance in importances_per_class_by_feature]
+                )


Thanks for catching this! Done! Here's a screenshot from tree booster on MNIST:

dbczumar · 2021-07-06T07:49:01Z

mlflow/xgboost.py

+            else:
+                label_classes_on_plot = True


Absolutely! Done! (Chose sum() for the magnitude metric rather than mean()). Here's a screenshot from a linear booster on MNIST:

harupy

LGTM!

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Jul 1, 2021

dbczumar added 5 commits July 1, 2021 15:32

Impl + test

8d9cd69

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Format

b1f4cf3

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Fix offsets

a8516ac

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Uncomment

3a17e06

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Spacing fix

f3ce845

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar force-pushed the linear_importance_plot branch from 0588f67 to f3ce845 Compare July 1, 2021 22:32

dbczumar requested a review from harupy July 1, 2021 22:32

harupy reviewed Jul 2, 2021

View reviewed changes

Address review comments

b22f55f

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar commented Jul 6, 2021

View reviewed changes

harupy approved these changes Jul 6, 2021

View reviewed changes

dbczumar merged commit e0e7181 into mlflow:master Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XGBoost autologging: support per-class importance plots #4523

XGBoost autologging: support per-class importance plots #4523

dbczumar commented Jul 1, 2021 •

edited by harupy

Loading

harupy Jul 2, 2021

dbczumar Jul 6, 2021

harupy Jul 2, 2021 •

edited

Loading

dbczumar Jul 6, 2021

harupy Jul 2, 2021

harupy Jul 2, 2021

dbczumar Jul 6, 2021

dbczumar left a comment

dbczumar Jul 6, 2021

dbczumar Jul 6, 2021

dbczumar Jul 6, 2021

harupy left a comment

-                        height=(0.5 / num_classes),
+                        height=(0.5 / max(num_classes- 1, 1)),
+                        # alternative approaches
+                        height=(0.5 / (num_classes- 1 if num_clalles > 1 else 1)),
+                        height=(0.5 / (num_clalles - 1 or 1)),

XGBoost autologging: support per-class importance plots #4523

XGBoost autologging: support per-class importance plots #4523

Conversation

dbczumar commented Jul 1, 2021 • edited by harupy Loading

What changes are proposed in this pull request?

How is this patch tested?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy Jul 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harupy left a comment

Choose a reason for hiding this comment

dbczumar commented Jul 1, 2021 •

edited by harupy

Loading

harupy Jul 2, 2021 •

edited

Loading