FIDAP (Feature Importance by DAta Permutation) is a model-free feature importance analysis tool that evaluates the importance of features in machine learning models using permutation-based methods.
FIDAP uses an appropriate metric (R² for regression, accuracy for classification, silhouette score for clustering) to evaluate model performance while shuffling data for each feature. If a feature is highly important, shuffling its values significantly impacts the evaluation metric. Conversely, low-importance features have minimal impact when permuted.
This method is inspired by the feature importance analysis described in Breiman's Random Forest paper.
Figure 1: Flowchart of the FIDAP method
uv is a fast Python package installer and resolver.
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP
# Install FIDAP and its dependencies
uv pip install -e .
# Or install with optional dependencies
uv pip install -e ".[all]" # Includes keras, xgboost, catboost, and dev tools# Clone the repository
git clone https://github.com/vd1371/FIDAP.git
cd FIDAP
# Install FIDAP
pip install -e .
# Or install with optional dependencies
pip install -e ".[all]"For development, install with dev dependencies:
uv pip install -e ".[dev]"This includes testing tools (pytest, pytest-cov), code formatting (black, ruff), and type checking (mypy).
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from FIDAP import FeatureImportanceAnalyzer
# Load data
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Analyze feature importance
analyzer = FeatureImportanceAnalyzer(
model,
X_test,
y_test,
n_feature_combination=1,
n_simulations=100
)
# Run analysis and generate reports
analyzer.run()
# Print results
print(analyzer)Feature FIDAP
--------------------------------------------------------------------------------
F(0,)-sepal length (cm) -0.0200
F(1,)-sepal width (cm) -0.0089
F(2,)-petal length (cm) 0.1067
F(3,)-petal width (cm) 0.3000
--------------------------------------------------------------------------------
- Model-agnostic: Works with any scikit-learn compatible model
- Multiple model types: Supports classification, regression, and clustering
- Feature combinations: Analyze importance of feature combinations
- Customizable metrics: Use any metric function or sklearn scorer string
- Comprehensive reports: Generates boxplots and statistical summaries
- Type hints: Full type annotation support for better IDE integration
- Random Forest
- Support Vector Machine
- Multi-layer Perceptron
- Decision Tree
- Extra Trees
- Radius Neighbors
- Passive Aggressive
- Gradient Boosting
- CatBoost
- K Nearest Neighbors
- Logistic Regression
- Linear Regression
- Support Vector Regression
- Deep Neural Networks (Keras/TensorFlow)
- Decision Tree
- Extra Trees
- Naïve Bayes
- Passive Aggressive
- Gradient Boosting
- XGBoost
- CatBoost
- K-Means
- Mean Shift
Main class for feature importance analysis.
- model (
Any): Trained prediction or clustering model - X (
pd.DataFrame,list, ornp.ndarray): Input features (2D) - Y (
pd.DataFrame,pd.Series,list,np.ndarray, orNone): Target variable (1D, optional for clustering) - features (
List[str], optional): Custom feature names - metric_fn (
str,callable, orNone): Metric function (default: auto-detected) - n_simulations (
int, default=100): Number of permutations per feature - pred_fn (
str, default="predict"): Prediction method name - direc (
strorPath, default="."): Output directory - verbose (
bool, default=False): Print progress messages - n_feature_combination (
int, default=1): Number of features to permute together - output_fig_format (
str, default="jpg"): Figure format - modelling_type (
str, optional): Model type ("classification", "regression", "clustering")
get(verbose=False): Calculate feature importance valuesboxplot(): Generate and save boxplotsummary(): Generate and save statistical summary CSVrun(): Run complete analysis (boxplot + summary)__str__(): Return formatted string representation
features_importance: Dictionary of mean importance scoresfeatures_importance_instances: Dictionary of importance value lists
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from FIDAP import FeatureImportanceAnalyzer
data = load_iris()
X, y = data.data, data.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
analyzer = FeatureImportanceAnalyzer(
model, X, y,
n_simulations=50,
verbose=True
)
importance = analyzer.get()
analyzer.boxplot()
analyzer.summary()from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import load_diabetes
from FIDAP import FeatureImportanceAnalyzer
data = load_diabetes()
X, y = data.data, data.target
model = GradientBoostingRegressor(random_state=42)
model.fit(X, y)
analyzer = FeatureImportanceAnalyzer(
model, X, y,
n_simulations=100,
metric_fn="r2"
)
analyzer.run()
print(analyzer)from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from FIDAP import FeatureImportanceAnalyzer
import numpy as np
X, _ = make_blobs(n_samples=300, n_features=4, centers=3, random_state=42)
model = KMeans(n_clusters=3, random_state=42, n_init="auto")
model.fit(X)
analyzer = FeatureImportanceAnalyzer(
model, X,
n_simulations=50
)
analyzer.run()Analyze importance of feature pairs:
analyzer = FeatureImportanceAnalyzer(
model, X, y,
n_feature_combination=2, # Analyze pairs of features
n_simulations=100
)
analyzer.run()from sklearn.metrics import f1_score
analyzer = FeatureImportanceAnalyzer(
model, X, y,
metric_fn=f1_score, # Custom metric function
n_simulations=100
)Using FIDAP, you can sort features by importance. For example, the figure below shows feature importance analysis for a Random Forest model on the Iris dataset. The most critical feature is "petal length," while "sepal width" is the least important.
Figure 2: Feature importance analysis for an RF model on the Iris dataset
Run tests using pytest:
# Run all tests
pytest
# Run with coverage
pytest --cov=FIDAP --cov-report=html
# Run specific test file
pytest tests/test_feature_importance_analyzer.py# Format code with black
black FIDAP tests
# Lint with ruff
ruff check FIDAP tests
# Type checking with mypy
mypy FIDAPFIDAP/
├── FIDAP/ # Main package
│ ├── __init__.py
│ ├── FeatureImportanceAnalyzer.py
│ ├── PixelImportanceAnalyzer.py
│ ├── check_X_Y_type_and_shape.py
│ ├── get_metric_fn.py
│ ├── get_type_of_modelling.py
│ ├── prepare_X_Y_features.py
│ ├── plot_box_and_save.py
│ ├── summarize.py
│ ├── get_string_report.py
│ └── get_importance/
│ ├── __init__.py
│ ├── get_features_importance.py
│ └── get_pixel_importance.py
├── tests/ # Test suite
│ ├── conftest.py
│ ├── test_feature_importance_analyzer.py
│ └── test_helper_functions.py
├── pyproject.toml # Project configuration
└── README.md
- Python >= 3.8
- NumPy >= 1.20.0
- Pandas >= 1.3.0
- SciPy >= 1.7.0
- scikit-learn >= 1.0.0
- Matplotlib >= 3.3.0
- Keras >= 2.8.0 (for DNN models)
- TensorFlow >= 2.8.0 (for DNN models)
- XGBoost >= 1.5.0 (for XGBoost models)
- CatBoost >= 1.0.0 (for CatBoost models)
If you use FIDAP in your research, please cite:
L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
© Vahid Asghari, Amin Baratian 2022. Licensed under the GNU General Public License v3.0 (GPLv3).
Contributions are welcome! Please feel free to submit a Pull Request.
If you encounter any issues or have questions, please open an issue on GitHub.
This project implements the permutation-based feature importance method described in Breiman's Random Forest paper, making it accessible as a standalone tool for any scikit-learn compatible model.

