-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display labels of categorical features in split nodes #86
Comments
Label-encoding categorical variables for decision trees or random forests is the easiest and most correct solution. That means converting every category level to a unique integer so it sounds like you're doing the right thing. If you pass in the list of category names that should display that in the trees. |
I was trying to replace columns Categorical <-> Integer but I cant make it work. Here is the example: import pandas as pd
from dtreeviz.trees import *
# example data set
df = pd.DataFrame({"feature_1": ["a","a","a","a","a","b","b","b","b","b"],
"feature_2": [0,0,0,0,1,0,1,1,1,1],
"target": [0,0,0,0,0,1,1,1,1,1]})
# apply categorical conversion
df["feature_1_converted"] = [0,0,0,0,0,1,1,1,1,1]
# train the tree with converted feature
classifier = tree.DecisionTreeClassifier(max_depth=1)
classifier.fit(df[["feature_1_converted", "feature_2"]], df["target"])
# try to plot the tree with original features
viz = dtreeviz(classifier,
df[["feature_1", "feature_2"]],
df["target"],
target_name='target',
feature_names=["feature_1", "feature_2"],
class_names=["0", "1"]) I got error: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-61889aa7e363> in <module>
4 target_name='target',
5 feature_names=["feature_1", "feature_2"],
----> 6 class_names=["0", "1"]
7 )
8
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, label_fontsize, ticks_fontsize, fontname, colors, scale)
778
779 shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 780 feature_names=feature_names, class_names=class_names)
781
782 if X is not None:
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
58 y_train = y_train.values
59 self.y_train = y_train
---> 60 self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
61 if self.isclassifier():
62 self.unique_target_values = np.unique(y_train)
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in node_samples(tree_model, data)
198 # Doc say: "Return a node indicator matrix where non zero elements
199 # indicates that the samples goes through the nodes."
--> 200 dec_paths = tree_model.decision_path(data)
201
202 # each sample has path taken down tree
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in decision_path(self, X, check_input)
495 indicates that the samples goes through the nodes.
496 """
--> 497 X = self._validate_X_predict(X, check_input)
498 return self.tree_.decision_path(X)
499
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
378 """Validate X whenever one tries to predict, apply, predict_proba"""
379 if check_input:
--> 380 X = check_array(X, dtype=DTYPE, accept_sparse="csr")
381 if issparse(X) and (X.indices.dtype != np.intc or
382 X.indptr.dtype != np.intc):
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not convert string to float: 'a' |
Can you use the LabelEncoder()? Or do it manually via https://github.com/parrt/stratx/blob/master/notebooks/support.py
|
ooh! you can't use
gotta use encoded feature 1 |
I do it all the time. have you looked at the examples? |
Hi @parrt, sure, I will take a look soon, I hope that today
…On Tue, Apr 14, 2020, 23:05 Terence Parr ***@***.***> wrote:
@tlapusan <https://github.com/tlapusan> did we break something? Can you
take a look and help @pplonski <https://github.com/pplonski> ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#86 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBYWNXH33ZNRWM2U2O5473RMS6WZANCNFSM4MHY5O5Q>
.
|
@parrt we did't break anything ;) @pplonski it was very helpful that you have send your code, it helped a lot for debugging . Please let a comment if it's working for you now. |
Do you think, is it possible to pass original categorical values, to be printed in the output tree? I would like to see 'a' and 'b' in |
Hang on. you're not talking about the target. ok, let me look. |
@tlapusan ha! We don't have an example where the tree nodes are cat vars! We should think about this. Nonetheless, you gotta pass in encoded vars to the classifier. We just need a split cat node example that shows how to get labels. Here's an example with a catvar split node, such as ProductID. Not sure we can label those when there are so many. maybe we just label the cats on either side of split point? |
Is there an option to pass encoding as an argument? |
@parrt we do have example for categorical nodes, but not in the readme page. we have them here, on titanic dataset (cabin_label feature): https://github.com/parrt/dtreeviz/blob/master/notebooks/tree_structure_example.ipynb. Would be nice and helpful to create a github wiki, to document the library even better. Putting everything in readme is kind of hard to follow and browse, especially when library will contain event more vizualisations :) Right, if the categorical variable has a high cardinality, it's gonna be very hard to display their raw labels...and maybe is even more confusing to do so. But, yes, we need to see and discuss on a more concrete example to see how it looks. Only in the case of categorical ordinal features would make more sense to display raw values. But I don't know an automatic way to detect encoded ordinal features. There are many ways to encode categorical variables, and to implement specific code for all of them...don't know if it is worth it. @pplonski what's the cardinality of your categorical features ? |
Hi, |
Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels? |
Can you please direct me to the relevant code? I have one feature that is categorical with three possible values, had to convert it to dummy columns, and as you can expect you look bad when I plot the decision tree. |
Hi guys, we're thinking about how to solve this. Maybe we show up to some |
@parrt good idea! There can be many ways to handle categoricals, so only requested by user category labels can be displayed. Maybe it can be done similar way as in which class names are displayed? User gives a dict as an input argument: feature_category_labels = {
"feature_1": {
0: "category_1",
1: "category_2",
...
},
# next features
} |
Hey! Any update on the issue? Is there a workaround? My cardinality is low (<10). Thank you! |
@mihagazvoda after spending few tens of minutes and looking into the code, I remembered that we implemented this for TensorFlow random forest, because it can support also categorical (string) values as a feature. As a workaround would be to use TF instead of what you are using now... Would be ok for you ? |
Sklearn decision tree doesn't work with categorical values. Before using decision tree, I'm converting categoricals, for example to integers (with
LabelEncoder
). In the tree visualization, there is presented converted value. Is there an option to better handle categoricals?The text was updated successfully, but these errors were encountered: