Display labels of categorical features in split nodes #86

pplonski · 2020-04-14T14:18:31Z

Sklearn decision tree doesn't work with categorical values. Before using decision tree, I'm converting categoricals, for example to integers (with LabelEncoder). In the tree visualization, there is presented converted value. Is there an option to better handle categoricals?

The text was updated successfully, but these errors were encountered:

parrt · 2020-04-14T17:45:02Z

Label-encoding categorical variables for decision trees or random forests is the easiest and most correct solution. That means converting every category level to a unique integer so it sounds like you're doing the right thing. If you pass in the list of category names that should display that in the trees.

pplonski · 2020-04-14T18:56:32Z

I was trying to replace columns Categorical <-> Integer but I cant make it work.

Here is the example:

import pandas as pd
from dtreeviz.trees import *
# example data set
df = pd.DataFrame({"feature_1": ["a","a","a","a","a","b","b","b","b","b"],
                      "feature_2": [0,0,0,0,1,0,1,1,1,1],
                      "target": [0,0,0,0,0,1,1,1,1,1]})
# apply categorical conversion
df["feature_1_converted"] = [0,0,0,0,0,1,1,1,1,1]
# train the tree with converted feature
classifier = tree.DecisionTreeClassifier(max_depth=1)   
classifier.fit(df[["feature_1_converted", "feature_2"]], df["target"])
# try to plot the tree with original features
viz = dtreeviz(classifier, 
               df[["feature_1", "feature_2"]], 
               df["target"],
               target_name='target',
               feature_names=["feature_1", "feature_2"], 
               class_names=["0", "1"])

I got error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-61889aa7e363> in <module>
      4                target_name='target',
      5               feature_names=["feature_1", "feature_2"],
----> 6                class_names=["0", "1"]
      7               )  
      8 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, label_fontsize, ticks_fontsize, fontname, colors, scale)
    778 
    779     shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 780                                 feature_names=feature_names, class_names=class_names)
    781 
    782     if X is not None:

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
     58             y_train = y_train.values
     59         self.y_train = y_train
---> 60         self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
     61         if self.isclassifier():
     62             self.unique_target_values = np.unique(y_train)

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in node_samples(tree_model, data)
    198         # Doc say: "Return a node indicator matrix where non zero elements
    199         #           indicates that the samples goes through the nodes."
--> 200         dec_paths = tree_model.decision_path(data)
    201 
    202         # each sample has path taken down tree

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in decision_path(self, X, check_input)
    495             indicates that the samples goes through the nodes.
    496         """
--> 497         X = self._validate_X_predict(X, check_input)
    498         return self.tree_.decision_path(X)
    499 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
    378         """Validate X whenever one tries to predict, apply, predict_proba"""
    379         if check_input:
--> 380             X = check_array(X, dtype=DTYPE, accept_sparse="csr")
    381             if issparse(X) and (X.indices.dtype != np.intc or
    382                                 X.indptr.dtype != np.intc):

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'a'

parrt · 2020-04-14T19:33:49Z

Can you use the LabelEncoder()? Or do it manually via https://github.com/parrt/stratx/blob/master/notebooks/support.py

def df_string_to_cat(df:pd.DataFrame) -> dict:
    catencoders = {}
    for colname in df.columns:
        if is_string_dtype(df[colname]) or is_object_dtype(df[colname]):
            df[colname] = df[colname].astype('category').cat.as_ordered()
            catencoders[colname] = df[colname].cat.categories
    return catencoders


def df_cat_to_catcode(df):
    for col in df.columns:
        if is_categorical_dtype(df[col]):
            df[col] = df[col].cat.codes + 1

parrt · 2020-04-14T19:34:18Z

ooh! you can't use

df[["feature_1", "feature_2"]],

gotta use encoded feature 1

parrt · 2020-04-14T19:52:33Z

I do it all the time. have you looked at the examples?

parrt · 2020-04-14T20:04:45Z

@tlapusan did we break something? Can you take a look and help @pplonski ?

tlapusan · 2020-04-16T05:34:32Z

Hi @parrt, sure, I will take a look soon, I hope that today

…

On Tue, Apr 14, 2020, 23:05 Terence Parr ***@***.***> wrote: @tlapusan <https://github.com/tlapusan> did we break something? Can you take a look and help @pplonski <https://github.com/pplonski> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#86 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBYWNXH33ZNRWM2U2O5473RMS6WZANCNFSM4MHY5O5Q> .

tlapusan · 2020-04-16T09:02:43Z

@tlapusan did we break something? Can you take a look and help @pplonski ?

@parrt we did't break anything ;)

@pplonski it was very helpful that you have send your code, it helped a lot for debugging .
The issue is that you've trained the model using df[["feature_1_converted", "feature_2"]] and df[["feature_1", "feature_2"]] to call dtreeviz() method. You need to have the same set of columns.

Please let a comment if it's working for you now.

pplonski · 2020-04-16T09:15:14Z

Do you think, is it possible to pass original categorical values, to be printed in the output tree? I would like to see 'a' and 'b' in feature_1 in the plot.

pplonski · 2020-04-16T09:19:03Z

The expected output tree:

parrt · 2020-04-16T16:09:30Z

Hang on. you're not talking about the target. ok, let me look.

parrt · 2020-04-16T16:13:23Z

@tlapusan ha! We don't have an example where the tree nodes are cat vars! We should think about this. Nonetheless, you gotta pass in encoded vars to the classifier. We just need a split cat node example that shows how to get labels. Here's an example with a catvar split node, such as ProductID. Not sure we can label those when there are so many. maybe we just label the cats on either side of split point?

pplonski · 2020-04-16T16:16:16Z

Is there an option to pass encoding as an argument?

tlapusan · 2020-04-16T19:33:22Z

@parrt we do have example for categorical nodes, but not in the readme page. we have them here, on titanic dataset (cabin_label feature): https://github.com/parrt/dtreeviz/blob/master/notebooks/tree_structure_example.ipynb.

Would be nice and helpful to create a github wiki, to document the library even better. Putting everything in readme is kind of hard to follow and browse, especially when library will contain event more vizualisations :)

Right, if the categorical variable has a high cardinality, it's gonna be very hard to display their raw labels...and maybe is even more confusing to do so. But, yes, we need to see and discuss on a more concrete example to see how it looks.

Only in the case of categorical ordinal features would make more sense to display raw values. But I don't know an automatic way to detect encoded ordinal features. There are many ways to encode categorical variables, and to implement specific code for all of them...don't know if it is worth it.

@pplonski what's the cardinality of your categorical features ?

chenhajaj · 2020-05-10T18:28:36Z

Hi,
I'm also stuck at the same place. We need to use some categorical features in the tree.

parrt · 2020-05-10T18:30:06Z

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

chenhajaj · 2020-05-10T19:30:26Z

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

Can you please direct me to the relevant code? I have one feature that is categorical with three possible values, had to convert it to dummy columns, and as you can expect you look bad when I plot the decision tree.

parrt · 2020-12-06T19:23:49Z

Hi guys, we're thinking about how to solve this. Maybe we show up to some n labels or a specific subset of labels requested by user.

pplonski · 2020-12-06T22:03:30Z

@parrt good idea! There can be many ways to handle categoricals, so only requested by user category labels can be displayed. Maybe it can be done similar way as in which class names are displayed? User gives a dict as an input argument:

feature_category_labels = {
  "feature_1": {
    0: "category_1",
    1: "category_2", 
    ...
  },
  # next features
}

mihagazvoda · 2023-10-06T14:45:07Z

Hey! Any update on the issue? Is there a workaround? My cardinality is low (<10). Thank you!

tlapusan · 2023-10-07T11:17:11Z

@mihagazvoda after spending few tens of minutes and looking into the code, I remembered that we implemented this for TensorFlow random forest, because it can support also categorical (string) values as a feature.
You can take a look at the Pclass node.

As a workaround would be to use TF instead of what you are using now... Would be ok for you ?

parrt added the duplicate This issue or pull request already exists label Apr 14, 2020

parrt closed this as completed Apr 14, 2020

Repository owner deleted a comment from pplonski Apr 14, 2020

parrt reopened this Apr 16, 2020

parrt added enhancement New feature or request and removed duplicate This issue or pull request already exists labels Apr 16, 2020

parrt changed the title ~~How categorical values are handled?~~ Display labels of categorical features in split nodes Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display labels of categorical features in split nodes #86

Display labels of categorical features in split nodes #86

pplonski commented Apr 14, 2020

parrt commented Apr 14, 2020

pplonski commented Apr 14, 2020 •

edited

Loading

parrt commented Apr 14, 2020

parrt commented Apr 14, 2020

parrt commented Apr 14, 2020 •

edited

Loading

parrt commented Apr 14, 2020

tlapusan commented Apr 16, 2020 via email

tlapusan commented Apr 16, 2020

pplonski commented Apr 16, 2020

pplonski commented Apr 16, 2020

parrt commented Apr 16, 2020 •

edited

Loading

parrt commented Apr 16, 2020 •

edited

Loading

pplonski commented Apr 16, 2020

tlapusan commented Apr 16, 2020

chenhajaj commented May 10, 2020

parrt commented May 10, 2020

chenhajaj commented May 10, 2020

parrt commented Dec 6, 2020

pplonski commented Dec 6, 2020

mihagazvoda commented Oct 6, 2023

tlapusan commented Oct 7, 2023

Display labels of categorical features in split nodes #86

Display labels of categorical features in split nodes #86

Comments

pplonski commented Apr 14, 2020

parrt commented Apr 14, 2020

pplonski commented Apr 14, 2020 • edited Loading

parrt commented Apr 14, 2020

parrt commented Apr 14, 2020

parrt commented Apr 14, 2020 • edited Loading

parrt commented Apr 14, 2020

tlapusan commented Apr 16, 2020 via email

tlapusan commented Apr 16, 2020

pplonski commented Apr 16, 2020

pplonski commented Apr 16, 2020

parrt commented Apr 16, 2020 • edited Loading

parrt commented Apr 16, 2020 • edited Loading

pplonski commented Apr 16, 2020

tlapusan commented Apr 16, 2020

chenhajaj commented May 10, 2020

parrt commented May 10, 2020

chenhajaj commented May 10, 2020

parrt commented Dec 6, 2020

pplonski commented Dec 6, 2020

mihagazvoda commented Oct 6, 2023

tlapusan commented Oct 7, 2023

pplonski commented Apr 14, 2020 •

edited

Loading

parrt commented Apr 14, 2020 •

edited

Loading

parrt commented Apr 16, 2020 •

edited

Loading

parrt commented Apr 16, 2020 •

edited

Loading