Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update explainer_base.py #424

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

praveenjune17
Copy link

Issue context
I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting
dice_ml.Data = metadata properties for each feature
algorithm = genetic
query_size > 1
permitted_range = None

Why the code fail for the above combination?
Turns out the values of the categorical features in the query instance are not label encoded but the values in the
feature_to_vary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied

Code to recreate the issue.

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import dice_ml
from dice_ml.utils import helpers # helper functions

dataset = helpers.load_adult_income_dataset()
target = dataset["income"]
train_dataset, test_dataset, y_train, y_test = train_test_split(dataset,
target,
test_size=0.2,
random_state=0,
stratify=target)
x_train = train_dataset.drop('income', axis=1)
x_test = test_dataset.drop('income', axis=1)

d = dice_ml.Data(features={'age': [17, 90],
'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'],
'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
'Prof-school', 'School', 'Some-college'],
'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'],
'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'],
'race': ['Other', 'White'],
'gender': ['Female', 'Male'],
'hours_per_week': [1, 99]},
outcome_name='income')

numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical)])

Append classifier to preprocessing pipeline.
Now we have a full prediction pipeline.

clf = Pipeline(steps=[('preprocessor', transformations),
('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)

Set the number of data points required in the query set

data_point = 2
m = dice_ml.Model(model=model, backend="sklearn")
exp = dice_ml.Dice(d, m, method="genetic")

query instance in the form of a dictionary; keys: feature name, values: feature value

query_instance = pd.DataFrame({'age': [22]*data_point,
'workclass': ['Private']*data_point,
'education': ['HS-grad']*data_point,
'marital_status': ['Single']*data_point,
'occupation': ['Service']*data_point,
'race': ['White']*data_point,
'gender': ['Female']*data_point,
'hours_per_week': [45]*data_point}, index=list(range(data_point)))

generate counterfactuals

dice_exp = exp.generate_counterfactuals(query_instance,
total_CFs=4,
desired_class="opposite",
initialization="random")

visualize the results

dice_exp.visualize_as_dataframe(show_only_changes=True)

Proposed fix
This fix will make sure "get_features_range(permitted_range)" is executed whether or not permitted_range is supplied or not

Bug fix to resolve "ValueError: ('Feature', {}, 'has a value outside the dataset.')" caused due to 'genetic' method when used for Private data with a query instance size > 1

Signed-off-by: Praveenkumar <praveen1050208@gmail.com>
Remove duplicate code

Signed-off-by: Praveenkumar <praveen1050208@gmail.com>
Copy link
Collaborator

@gaugup gaugup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praveenjune17, could you please add a unit test for this change? Should be easy since you know how to re-create he issue?

Add test query dataset and model

Signed-off-by: Praveenkumar <praveen1050208@gmail.com>
Add test case for the fix

Signed-off-by: Praveenkumar <praveen1050208@gmail.com>
@praveenjune17
Copy link
Author

@gaugup . pls review the test cases

@praveenjune17 praveenjune17 requested a review from gaugup January 1, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants