Skip to content

vd1371/FIDAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FIDAP (Feature Importance by Data Permutation)

Python 3.8+ License: GPL v3

FIDAP (Feature Importance by DAta Permutation) is a model-free feature importance analysis tool that evaluates the importance of features in machine learning models using permutation-based methods.

Overview

FIDAP uses an appropriate metric (R² for regression, accuracy for classification, silhouette score for clustering) to evaluate model performance while shuffling data for each feature. If a feature is highly important, shuffling its values significantly impacts the evaluation metric. Conversely, low-importance features have minimal impact when permuted.

This method is inspired by the feature importance analysis described in Breiman's Random Forest paper.

Flowchart

Flowchart

Figure 1: Flowchart of the FIDAP method

Installation

Using uv (Recommended)

uv is a fast Python package installer and resolver.

# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP

# Install FIDAP and its dependencies
uv pip install -e .

# Or install with optional dependencies
uv pip install -e ".[all]"  # Includes keras, xgboost, catboost, and dev tools

Using pip

# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP

# Install FIDAP
pip install -e .

# Or install with optional dependencies
pip install -e ".[all]"

Development Installation

For development, install with dev dependencies:

uv pip install -e ".[dev]"

This includes testing tools (pytest, pytest-cov), code formatting (black, ruff), and type checking (mypy).

Quick Start

Basic Usage

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from FIDAP import FeatureImportanceAnalyzer

# Load data
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Analyze feature importance
analyzer = FeatureImportanceAnalyzer(
    model, 
    X_test, 
    y_test,
    n_feature_combination=1,
    n_simulations=100
)

# Run analysis and generate reports
analyzer.run()

# Print results
print(analyzer)

Output Example

Feature                                                                  FIDAP  
--------------------------------------------------------------------------------
F(0,)-sepal length (cm)                                                 -0.0200
F(1,)-sepal width (cm)                                                  -0.0089
F(2,)-petal length (cm)                                                  0.1067
F(3,)-petal width (cm)                                                   0.3000
--------------------------------------------------------------------------------

Features

  • Model-agnostic: Works with any scikit-learn compatible model
  • Multiple model types: Supports classification, regression, and clustering
  • Feature combinations: Analyze importance of feature combinations
  • Customizable metrics: Use any metric function or sklearn scorer string
  • Comprehensive reports: Generates boxplots and statistical summaries
  • Type hints: Full type annotation support for better IDE integration

Supported Models

Classification

  • Random Forest
  • Support Vector Machine
  • Multi-layer Perceptron
  • Decision Tree
  • Extra Trees
  • Radius Neighbors
  • Passive Aggressive
  • Gradient Boosting
  • CatBoost
  • K Nearest Neighbors
  • Logistic Regression

Regression

  • Linear Regression
  • Support Vector Regression
  • Deep Neural Networks (Keras/TensorFlow)
  • Decision Tree
  • Extra Trees
  • Naïve Bayes
  • Passive Aggressive
  • Gradient Boosting
  • XGBoost
  • CatBoost

Clustering

  • K-Means
  • Mean Shift

API Reference

FeatureImportanceAnalyzer

Main class for feature importance analysis.

Parameters

  • model (Any): Trained prediction or clustering model
  • X (pd.DataFrame, list, or np.ndarray): Input features (2D)
  • Y (pd.DataFrame, pd.Series, list, np.ndarray, or None): Target variable (1D, optional for clustering)
  • features (List[str], optional): Custom feature names
  • metric_fn (str, callable, or None): Metric function (default: auto-detected)
  • n_simulations (int, default=100): Number of permutations per feature
  • pred_fn (str, default="predict"): Prediction method name
  • direc (str or Path, default="."): Output directory
  • verbose (bool, default=False): Print progress messages
  • n_feature_combination (int, default=1): Number of features to permute together
  • output_fig_format (str, default="jpg"): Figure format
  • modelling_type (str, optional): Model type ("classification", "regression", "clustering")

Methods

  • get(verbose=False): Calculate feature importance values
  • boxplot(): Generate and save boxplot
  • summary(): Generate and save statistical summary CSV
  • run(): Run complete analysis (boxplot + summary)
  • __str__(): Return formatted string representation

Attributes

  • features_importance: Dictionary of mean importance scores
  • features_importance_instances: Dictionary of importance value lists

Examples

Classification Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from FIDAP import FeatureImportanceAnalyzer

data = load_iris()
X, y = data.data, data.target

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_simulations=50,
    verbose=True
)
importance = analyzer.get()
analyzer.boxplot()
analyzer.summary()

Regression Example

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_diabetes
from FIDAP import FeatureImportanceAnalyzer

data = load_diabetes()
X, y = data.data, data.target

model = GradientBoostingRegressor(random_state=42)
model.fit(X, y)

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_simulations=100,
    metric_fn="r2"
)
analyzer.run()
print(analyzer)

Clustering Example

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from FIDAP import FeatureImportanceAnalyzer
import numpy as np

X, _ = make_blobs(n_samples=300, n_features=4, centers=3, random_state=42)

model = KMeans(n_clusters=3, random_state=42, n_init="auto")
model.fit(X)

analyzer = FeatureImportanceAnalyzer(
    model, X,
    n_simulations=50
)
analyzer.run()

Feature Combinations

Analyze importance of feature pairs:

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    n_feature_combination=2,  # Analyze pairs of features
    n_simulations=100
)
analyzer.run()

Custom Metric

from sklearn.metrics import f1_score

analyzer = FeatureImportanceAnalyzer(
    model, X, y,
    metric_fn=f1_score,  # Custom metric function
    n_simulations=100
)

Results and Insights

Using FIDAP, you can sort features by importance. For example, the figure below shows feature importance analysis for a Random Forest model on the Iris dataset. The most critical feature is "petal length," while "sepal width" is the least important.

Box Plot Output

Figure 2: Feature importance analysis for an RF model on the Iris dataset

Testing

Run tests using pytest:

# Run all tests
pytest

# Run with coverage
pytest --cov=FIDAP --cov-report=html

# Run specific test file
pytest tests/test_feature_importance_analyzer.py

Development

Code Formatting

# Format code with black
black FIDAP tests

# Lint with ruff
ruff check FIDAP tests

# Type checking with mypy
mypy FIDAP

Project Structure

FIDAP/
├── FIDAP/                    # Main package
│   ├── __init__.py
│   ├── FeatureImportanceAnalyzer.py
│   ├── PixelImportanceAnalyzer.py
│   ├── check_X_Y_type_and_shape.py
│   ├── get_metric_fn.py
│   ├── get_type_of_modelling.py
│   ├── prepare_X_Y_features.py
│   ├── plot_box_and_save.py
│   ├── summarize.py
│   ├── get_string_report.py
│   └── get_importance/
│       ├── __init__.py
│       ├── get_features_importance.py
│       └── get_pixel_importance.py
├── tests/                    # Test suite
│   ├── conftest.py
│   ├── test_feature_importance_analyzer.py
│   └── test_helper_functions.py
├── pyproject.toml            # Project configuration
└── README.md

Requirements

Core Dependencies

  • Python >= 3.8
  • NumPy >= 1.20.0
  • Pandas >= 1.3.0
  • SciPy >= 1.7.0
  • scikit-learn >= 1.0.0
  • Matplotlib >= 3.3.0

Optional Dependencies

  • Keras >= 2.8.0 (for DNN models)
  • TensorFlow >= 2.8.0 (for DNN models)
  • XGBoost >= 1.5.0 (for XGBoost models)
  • CatBoost >= 1.0.0 (for CatBoost models)

Citation

If you use FIDAP in your research, please cite:

L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.

License

© Vahid Asghari, Amin Baratian 2022. Licensed under the GNU General Public License v3.0 (GPLv3).

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If you encounter any issues or have questions, please open an issue on GitHub.

Acknowledgments

This project implements the permutation-based feature importance method described in Breiman's Random Forest paper, making it accessible as a standalone tool for any scikit-learn compatible model.

About

Feature Importance by DAta Permutation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages