FIDAP (Feature Importance by Data Permutation)

FIDAP (Feature Importance by DAta Permutation) is a model-free feature importance analysis tool that evaluates the importance of features in machine learning models using permutation-based methods.

Overview

FIDAP uses an appropriate metric (R² for regression, accuracy for classification, silhouette score for clustering) to evaluate model performance while shuffling data for each feature. If a feature is highly important, shuffling its values significantly impacts the evaluation metric. Conversely, low-importance features have minimal impact when permuted.

This method is inspired by the feature importance analysis described in Breiman's Random Forest paper.

Flowchart

Figure 1: Flowchart of the FIDAP method

Installation

Using uv (Recommended)

uv is a fast Python package installer and resolver.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP

# Install FIDAP and its dependencies
uv pip install -e .

# Or install with optional dependencies
uv pip install -e ".[all]"  # Includes keras, xgboost, catboost, and dev tools

Using pip

# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP

# Install FIDAP
pip install -e .

# Or install with optional dependencies
pip install -e ".[all]"

Development Installation

For development, install with dev dependencies:

uv pip install -e ".[dev]"

This includes testing tools (pytest, pytest-cov), code formatting (black, ruff), and type checking (mypy).

Quick Start

Basic Usage

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from FIDAP import FeatureImportanceAnalyzer

# Load data
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Analyze feature importance
analyzer = FeatureImportanceAnalyzer(
    model, 
    X_test, 
    y_test,
    n_feature_combination=1,
    n_simulations=100
)

# Run analysis and generate reports
analyzer.run()

# Print results
print(analyzer)

Output Example

Feature                                                                  FIDAP  
--------------------------------------------------------------------------------
F(0,)-sepal length (cm)                                                 -0.0200
F(1,)-sepal width (cm)                                                  -0.0089
F(2,)-petal length (cm)                                                  0.1067
F(3,)-petal width (cm)                                                   0.3000
--------------------------------------------------------------------------------

Features

Model-agnostic: Works with any scikit-learn compatible model
Multiple model types: Supports classification, regression, and clustering
Feature combinations: Analyze importance of feature combinations
Customizable metrics: Use any metric function or sklearn scorer string
Comprehensive reports: Generates boxplots and statistical summaries
Type hints: Full type annotation support for better IDE integration

Supported Models

Classification

Random Forest
Support Vector Machine
Multi-layer Perceptron
Decision Tree
Extra Trees
Radius Neighbors
Passive Aggressive
Gradient Boosting
CatBoost
K Nearest Neighbors
Logistic Regression

Regression

Linear Regression
Support Vector Regression
Deep Neural Networks (Keras/TensorFlow)
Decision Tree
Extra Trees
Naïve Bayes
Passive Aggressive
Gradient Boosting
XGBoost
CatBoost

Clustering

K-Means
Mean Shift

API Reference

`FeatureImportanceAnalyzer`

Main class for feature importance analysis.

Parameters

model (Any): Trained prediction or clustering model
X (pd.DataFrame, list, or np.ndarray): Input features (2D)
Y (pd.DataFrame, pd.Series, list, np.ndarray, or None): Target variable (1D, optional for clustering)
features (List[str], optional): Custom feature names
metric_fn (str, callable, or None): Metric function (default: auto-detected)
n_simulations (int, default=100): Number of permutations per feature
pred_fn (str, default="predict"): Prediction method name
direc (str or Path, default="."): Output directory
verbose (bool, default=False): Print progress messages
n_feature_combination (int, default=1): Number of features to permute together
output_fig_format (str, default="jpg"): Figure format
modelling_type (str, optional): Model type ("classification", "regression", "clustering")

Methods

get(verbose=False): Calculate feature importance values
boxplot(): Generate and save boxplot
summary(): Generate and save statistical summary CSV
run(): Run complete analysis (boxplot + summary)
__str__(): Return formatted string representation

Attributes

features_importance: Dictionary of mean importance scores
features_importance_instances: Dictionary of importance value lists

Examples

Classification Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from FIDAP import FeatureImportanceAnalyzer

data = load_iris()
X, y = data.data, data.target

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_simulations=50,
    verbose=True
)
importance = analyzer.get()
analyzer.boxplot()
analyzer.summary()

Regression Example

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_diabetes
from FIDAP import FeatureImportanceAnalyzer

data = load_diabetes()
X, y = data.data, data.target

model = GradientBoostingRegressor(random_state=42)
model.fit(X, y)

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_simulations=100,
    metric_fn="r2"
)
analyzer.run()
print(analyzer)

Clustering Example

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from FIDAP import FeatureImportanceAnalyzer
import numpy as np

X, _ = make_blobs(n_samples=300, n_features=4, centers=3, random_state=42)

model = KMeans(n_clusters=3, random_state=42, n_init="auto")
model.fit(X)

analyzer = FeatureImportanceAnalyzer(
    model, X,
    n_simulations=50
)
analyzer.run()

Feature Combinations

Analyze importance of feature pairs:

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_feature_combination=2,  # Analyze pairs of features
    n_simulations=100
)
analyzer.run()

Custom Metric

from sklearn.metrics import f1_score

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    metric_fn=f1_score,  # Custom metric function
    n_simulations=100
)

Results and Insights

Using FIDAP, you can sort features by importance. For example, the figure below shows feature importance analysis for a Random Forest model on the Iris dataset. The most critical feature is "petal length," while "sepal width" is the least important.

Figure 2: Feature importance analysis for an RF model on the Iris dataset

Testing

Run tests using pytest:

# Run all tests
pytest

# Run with coverage
pytest --cov=FIDAP --cov-report=html

# Run specific test file
pytest tests/test_feature_importance_analyzer.py

Development

Code Formatting

# Format code with black
black FIDAP tests

# Lint with ruff
ruff check FIDAP tests

# Type checking with mypy
mypy FIDAP

Project Structure

FIDAP/
├── FIDAP/                    # Main package
│   ├── __init__.py
│   ├── FeatureImportanceAnalyzer.py
│   ├── PixelImportanceAnalyzer.py
│   ├── check_X_Y_type_and_shape.py
│   ├── get_metric_fn.py
│   ├── get_type_of_modelling.py
│   ├── prepare_X_Y_features.py
│   ├── plot_box_and_save.py
│   ├── summarize.py
│   ├── get_string_report.py
│   └── get_importance/
│       ├── __init__.py
│       ├── get_features_importance.py
│       └── get_pixel_importance.py
├── tests/                    # Test suite
│   ├── conftest.py
│   ├── test_feature_importance_analyzer.py
│   └── test_helper_functions.py
├── pyproject.toml            # Project configuration
└── README.md

Requirements

Core Dependencies

Python >= 3.8
NumPy >= 1.20.0
Pandas >= 1.3.0
SciPy >= 1.7.0
scikit-learn >= 1.0.0
Matplotlib >= 3.3.0

Optional Dependencies

Keras >= 2.8.0 (for DNN models)
TensorFlow >= 2.8.0 (for DNN models)
XGBoost >= 1.5.0 (for XGBoost models)
CatBoost >= 1.0.0 (for CatBoost models)

Citation

If you use FIDAP in your research, please cite:

L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.

License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If you encounter any issues or have questions, please open an issue on GitHub.

Acknowledgments

This project implements the permutation-based feature importance method described in Breiman's Random Forest paper, making it accessible as a standalone tool for any scikit-learn compatible model.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
FIDAP		FIDAP
resources		resources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

vd1371/FIDAP

Folders and files

Latest commit

History

Repository files navigation

FIDAP (Feature Importance by Data Permutation)

Overview

Flowchart

Installation

Using uv (Recommended)

Using pip

Development Installation

Quick Start

Basic Usage

Output Example

Features

Supported Models

Classification

Regression

Clustering

API Reference

FeatureImportanceAnalyzer

Parameters

Methods

Attributes

Examples

Classification Example

Regression Example

Clustering Example

Feature Combinations

Custom Metric

Results and Insights

Testing

Development

Code Formatting

Project Structure

Requirements

Core Dependencies

Optional Dependencies

Citation

License

Contributing

Issues

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`FeatureImportanceAnalyzer`

Packages