Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uplift tree/forest: add feature importance and parallelize forest #220

Merged
merged 4 commits into from
Jul 29, 2020

Conversation

yungmsh
Copy link
Contributor

@yungmsh yungmsh commented Jul 29, 2020

Add feature importance

  • Compute feature importance using similar methodology as scikit-learn (see below):
    cpdef compute_feature_importances(self, normalize=True):
        """Computes the importance of each feature (aka variable)."""
        cdef Node* left
        cdef Node* right
        cdef Node* nodes = self.nodes
        cdef Node* node = nodes
        cdef Node* end_node = node + self.node_count

        cdef double normalizer = 0.

        cdef np.ndarray[np.float64_t, ndim=1] importances
        importances = np.zeros((self.n_features,))
        cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

        with nogil:
            while node != end_node:
                if node.left_child != _TREE_LEAF:
                    # ... and node.right_child != _TREE_LEAF:
                    left = &nodes[node.left_child]
                    right = &nodes[node.right_child]

                    importance_data[node.feature] += (
                        node.weighted_n_node_samples * node.impurity -
                        left.weighted_n_node_samples * left.impurity -
                        right.weighted_n_node_samples * right.impurity)
                node += 1

        importances /= nodes[0].weighted_n_node_samples

        if normalize:
            normalizer = np.sum(importances)

            if normalizer > 0.0:
                # Avoid dividing by zero (e.g., when root is pure)
                importances /= normalizer

        return importances

Parallelize Forest

  • using joblib's Parallel to parallelize building trees for UpliftRandomForestClassifier.
  • Speeds up ~60% for dataset with shape=(40000, 20) using 50 estimators, setting n_jobs=8 on a 16GB, 3.1GHz machine)


all_importances = [tree.feature_importances_ for tree in self.uplift_forest]
self.feature_importances_ = np.mean(all_importances, axis=0)
self.feature_importances_ /= self.feature_importances_.sum() # normalize to add to 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a self.feature_importances_.sum() > 0 check here as shown in your compute_feature_importances() reference to avoid dividing by zero (e.g., when root is pure)? This might be an extreme edge case though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout- I left it out because it should be the rare extreme case (also, if the root is pure, the user is not using uplift trees correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah if we want to prevent an error from raising, we can add the condition

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree your point that it's minor and more of an user error, thanks for confirming!

Copy link
Collaborator

@paullo0106 paullo0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yungmsh
Copy link
Contributor Author

yungmsh commented Jul 29, 2020

Looks like the build is failing here (https://travis-ci.com/github/uber/causalml/jobs/366341859) - anyone seen this kind of error before? @jeongyoonlee @ppstacy @paullo0106

@yungmsh
Copy link
Contributor Author

yungmsh commented Jul 29, 2020

Never mind, I just re-ran the build and it's passing.

@yungmsh yungmsh merged commit 7d68f5f into master Jul 29, 2020
huigangchen pushed a commit that referenced this pull request Jan 27, 2021
Uplift tree/forest: add feature importance and parallelize forest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants