-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Description
I would like to add a new function error_rate_by_feature to the XAI module. This function calculates specific error rates (False Negatives and False Positives) aggregated by a specific feature's values.
This allows for detecting model underperformance in specific segments (e.g., high FN rate in a specific "Region" or "Category"). The function should handle multi-class inputs by focusing on the positive_class (One-vs-Rest).
Requirements
- Input Parameters::
y_true: Array-like of true labels.y_pred: Array-like of predicted labels.y_proba: Array-like of predicted probabilities.X: DataFrame containing the features.feature_name: str – Feature to group by.positive_class: The class to treat as positive.sort_metric: str = 'FN' – Metric to sort groups by (options: 'FN', 'FP', 'error_rate', etc.).bins: int = 10 – For binning numerical features.threshold: float = 0.5.
- Functionality:
- Compute
error_dfinternally with 'error_type' ('TP', 'TN', 'FP', 'FN'). here is a skeleton code for binary class:
error_df = X_test.copy() error_df["y_true"] = y_test error_df["y_pred"] = y_pred error_df["y_proba"] = y_proba error_df["error_type"] = np.select( [ (error_df.y_true == 1) & (error_df.y_pred == 0), (error_df.y_true == 0) & (error_df.y_pred == 1), ], ["false_negative", "false_positive"], default="correct" )
- Group by binned/categorical
feature_name. - Calculate counts and rates for error types, error_rate = (FP + FN) / total.
- Sort summary by
sort_metric(default 'FN' count or rate). - Output summary DataFrame.
- Compute
Code Skeleton to Add:
def error_rate_by_feature(df, feature, min_count=50):
summary = (
df.groupby(feature)
.agg(
total=("error_type", "count"),
false_negatives=("error_type", lambda x: (x == "false_negative").sum()),
false_positives=("error_type", lambda x: (x == "false_positive").sum()),
)
)
summary["fn_rate"] = summary["false_negatives"] / summary["total"]
summary["fp_rate"] = summary["false_positives"] / summary["total"]
summary = summary[summary["total"] >= min_count]
return summary.sort_values("fn_rate", ascending=False)- Example usage:
```summary = analyze_errors_by_feature(y_true, y_pred, y_proba, features_df, "size_max", positive_class="won")``
Here is an example result:
| total | false_negatives | false_positives | fn_rate | fp_rate | |
|---|---|---|---|---|---|
| size_min | |||||
| 10001.0 | 96 | 15 | 2 | 0.156250 | 0.020833 |
| 1001.0 | 297 | 46 | 6 | 0.154882 | 0.020202 |
| 201.0 | 333 | 45 | 14 | 0.135135 | 0.042042 |
| 501.0 | 217 | 25 | 4 | 0.115207 | 0.018433 |
| 51.0 | 332 | 33 | 29 | 0.099398 | 0.087349 |
| 5001.0 | 67 | 3 | 0 | 0.044776 | 0.000000 |
| 11.0 | 371 | 6 | 17 | 0.016173 | 0.045822 |
| 1.0 | 444 | 4 | 12 | 0.009009 | 0.027027 |
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request