Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display labels of categorical features in split nodes #86

Open
pplonski opened this issue Apr 14, 2020 · 21 comments
Open

Display labels of categorical features in split nodes #86

pplonski opened this issue Apr 14, 2020 · 21 comments
Labels
enhancement New feature or request

Comments

@pplonski
Copy link

Sklearn decision tree doesn't work with categorical values. Before using decision tree, I'm converting categoricals, for example to integers (with LabelEncoder). In the tree visualization, there is presented converted value. Is there an option to better handle categoricals?

@parrt
Copy link
Owner

parrt commented Apr 14, 2020

Label-encoding categorical variables for decision trees or random forests is the easiest and most correct solution. That means converting every category level to a unique integer so it sounds like you're doing the right thing. If you pass in the list of category names that should display that in the trees.

@parrt parrt added the duplicate This issue or pull request already exists label Apr 14, 2020
@pplonski
Copy link
Author

pplonski commented Apr 14, 2020

I was trying to replace columns Categorical <-> Integer but I cant make it work.

Here is the example:

import pandas as pd
from dtreeviz.trees import *
# example data set
df = pd.DataFrame({"feature_1": ["a","a","a","a","a","b","b","b","b","b"],
                      "feature_2": [0,0,0,0,1,0,1,1,1,1],
                      "target": [0,0,0,0,0,1,1,1,1,1]})
# apply categorical conversion
df["feature_1_converted"] = [0,0,0,0,0,1,1,1,1,1]
# train the tree with converted feature
classifier = tree.DecisionTreeClassifier(max_depth=1)   
classifier.fit(df[["feature_1_converted", "feature_2"]], df["target"])
# try to plot the tree with original features
viz = dtreeviz(classifier, 
               df[["feature_1", "feature_2"]], 
               df["target"],
               target_name='target',
               feature_names=["feature_1", "feature_2"], 
               class_names=["0", "1"])  

I got error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-61889aa7e363> in <module>
      4                target_name='target',
      5               feature_names=["feature_1", "feature_2"],
----> 6                class_names=["0", "1"]
      7               )  
      8 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, label_fontsize, ticks_fontsize, fontname, colors, scale)
    778 
    779     shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 780                                 feature_names=feature_names, class_names=class_names)
    781 
    782     if X is not None:

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
     58             y_train = y_train.values
     59         self.y_train = y_train
---> 60         self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
     61         if self.isclassifier():
     62             self.unique_target_values = np.unique(y_train)

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in node_samples(tree_model, data)
    198         # Doc say: "Return a node indicator matrix where non zero elements
    199         #           indicates that the samples goes through the nodes."
--> 200         dec_paths = tree_model.decision_path(data)
    201 
    202         # each sample has path taken down tree

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in decision_path(self, X, check_input)
    495             indicates that the samples goes through the nodes.
    496         """
--> 497         X = self._validate_X_predict(X, check_input)
    498         return self.tree_.decision_path(X)
    499 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
    378         """Validate X whenever one tries to predict, apply, predict_proba"""
    379         if check_input:
--> 380             X = check_array(X, dtype=DTYPE, accept_sparse="csr")
    381             if issparse(X) and (X.indices.dtype != np.intc or
    382                                 X.indptr.dtype != np.intc):

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'a'

@parrt
Copy link
Owner

parrt commented Apr 14, 2020

Can you use the LabelEncoder()? Or do it manually via https://github.com/parrt/stratx/blob/master/notebooks/support.py

def df_string_to_cat(df:pd.DataFrame) -> dict:
    catencoders = {}
    for colname in df.columns:
        if is_string_dtype(df[colname]) or is_object_dtype(df[colname]):
            df[colname] = df[colname].astype('category').cat.as_ordered()
            catencoders[colname] = df[colname].cat.categories
    return catencoders


def df_cat_to_catcode(df):
    for col in df.columns:
        if is_categorical_dtype(df[col]):
            df[col] = df[col].cat.codes + 1

@parrt
Copy link
Owner

parrt commented Apr 14, 2020

ooh! you can't use

df[["feature_1", "feature_2"]], 

gotta use encoded feature 1

@parrt parrt closed this as completed Apr 14, 2020
Repository owner deleted a comment from pplonski Apr 14, 2020
@parrt
Copy link
Owner

parrt commented Apr 14, 2020

I do it all the time. have you looked at the examples?

@parrt
Copy link
Owner

parrt commented Apr 14, 2020

@tlapusan did we break something? Can you take a look and help @pplonski ?

@tlapusan
Copy link
Collaborator

tlapusan commented Apr 16, 2020 via email

@tlapusan
Copy link
Collaborator

@tlapusan did we break something? Can you take a look and help @pplonski ?

@parrt we did't break anything ;)

@pplonski it was very helpful that you have send your code, it helped a lot for debugging .
The issue is that you've trained the model using df[["feature_1_converted", "feature_2"]] and df[["feature_1", "feature_2"]] to call dtreeviz() method. You need to have the same set of columns.

Please let a comment if it's working for you now.

Screen Shot 2020-04-16 at 11 58 58 AM

@pplonski
Copy link
Author

Do you think, is it possible to pass original categorical values, to be printed in the output tree? I would like to see 'a' and 'b' in feature_1 in the plot.

@pplonski
Copy link
Author

The expected output tree:

image

@parrt
Copy link
Owner

parrt commented Apr 16, 2020

Hang on. you're not talking about the target. ok, let me look.

@parrt
Copy link
Owner

parrt commented Apr 16, 2020

@tlapusan ha! We don't have an example where the tree nodes are cat vars! We should think about this. Nonetheless, you gotta pass in encoded vars to the classifier. We just need a split cat node example that shows how to get labels. Here's an example with a catvar split node, such as ProductID. Not sure we can label those when there are so many. maybe we just label the cats on either side of split point?

@parrt parrt reopened this Apr 16, 2020
@pplonski
Copy link
Author

Is there an option to pass encoding as an argument?

@parrt parrt added enhancement New feature or request and removed duplicate This issue or pull request already exists labels Apr 16, 2020
@parrt parrt changed the title How categorical values are handled? Display labels of categorical features in split nodes Apr 16, 2020
@tlapusan
Copy link
Collaborator

@parrt we do have example for categorical nodes, but not in the readme page. we have them here, on titanic dataset (cabin_label feature): https://github.com/parrt/dtreeviz/blob/master/notebooks/tree_structure_example.ipynb.

Would be nice and helpful to create a github wiki, to document the library even better. Putting everything in readme is kind of hard to follow and browse, especially when library will contain event more vizualisations :)

Right, if the categorical variable has a high cardinality, it's gonna be very hard to display their raw labels...and maybe is even more confusing to do so. But, yes, we need to see and discuss on a more concrete example to see how it looks.

Only in the case of categorical ordinal features would make more sense to display raw values. But I don't know an automatic way to detect encoded ordinal features. There are many ways to encode categorical variables, and to implement specific code for all of them...don't know if it is worth it.

@pplonski what's the cardinality of your categorical features ?

@chenhajaj
Copy link

Hi,
I'm also stuck at the same place. We need to use some categorical features in the tree.

@parrt
Copy link
Owner

parrt commented May 10, 2020

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

@chenhajaj
Copy link

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

Can you please direct me to the relevant code? I have one feature that is categorical with three possible values, had to convert it to dummy columns, and as you can expect you look bad when I plot the decision tree.

@parrt
Copy link
Owner

parrt commented Dec 6, 2020

Hi guys, we're thinking about how to solve this. Maybe we show up to some n labels or a specific subset of labels requested by user.

@pplonski
Copy link
Author

pplonski commented Dec 6, 2020

@parrt good idea! There can be many ways to handle categoricals, so only requested by user category labels can be displayed. Maybe it can be done similar way as in which class names are displayed? User gives a dict as an input argument:

feature_category_labels = {
  "feature_1": {
    0: "category_1",
    1: "category_2", 
    ...
  },
  # next features
}

@mihagazvoda
Copy link

Hey! Any update on the issue? Is there a workaround? My cardinality is low (<10). Thank you!

@tlapusan
Copy link
Collaborator

tlapusan commented Oct 7, 2023

@mihagazvoda after spending few tens of minutes and looking into the code, I remembered that we implemented this for TensorFlow random forest, because it can support also categorical (string) values as a feature.
You can take a look at the Pclass node.
Screenshot 2023-10-07 at 14 13 28

As a workaround would be to use TF instead of what you are using now... Would be ok for you ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants