Smart Datalyzer is an intelligent, automated toolkit for comprehensive data analysis, visualization, and reporting. It provides ML readiness scoring, advanced statistical diagnostics, and publication-quality visualizations with minimal effort.
- Smart Dataset Loading: Automatic detection of CSV/XLSX files with type inference
- Duplicate Detection: Identify and report duplicate rows
- Mixed Type Detection: Find columns with inconsistent data types
- Auto Type Conversion: Intelligent conversion of string columns to numeric
- Missing Value Analysis: Detection and imputation suggestions
- Constant Column Detection: Identify features with zero variance
- Scaling Issue Detection: Flag features with extreme value ranges
- Target Leakage Detection: Identify features that leak target information (>95% accuracy)
- Class Imbalance Analysis: Compute imbalance ratios and distribution statistics
- Feature-Target Association: Statistical tests (ANOVA, Kruskal-Wallis, Chi-square)
- Sensitivity Analysis: Permutation importance for feature ranking
- Model Suggestion: Automatic recommendation (Regression vs Classification)
- Normality Testing: Shapiro-Wilk, D'Agostino, Kolmogorov-Smirnov tests with QQ plots
- Outlier Detection: Z-score based detection with percentage reporting
- Correlation Analysis: Pearson, Spearman, Kendall correlation matrices
- VIF Computation: Variance Inflation Factor for multicollinearity detection
- Mutual Information: Feature importance via mutual information scores
- Covariance Matrix: Full covariance analysis with CSV export
- High Correlation Flagging: Automatic detection of correlated pairs (>0.9)
- Distribution Plots: Histograms with KDE overlays
- Box Plots: Outlier visualization with quartile analysis
- Violin Plots: Distribution density visualization
- Swarm Plots: Individual data point overlay on boxplots
- QQ Plots: Quantile-quantile plots for normality assessment
- Correlation Heatmaps: Multiple correlation methods with annotations
- Feature Importance Charts: RandomForest-based importance ranking
- PCA Variance Plots: Principal component analysis visualization
- t-SNE Scatter Plots: 2D dimensionality reduction visualization
- Interactive HTML Reports: Comprehensive analysis with embedded visualizations
- JSON Export: Machine-readable summary statistics
- PDF Generation: Publication-ready reports (optional)
- Plot Export: High-resolution PNG plots (300 DPI)
- Caching System: Smart caching for faster re-analysis
- Automatic feature engineering recommendations
- ML readiness scoring (0-100)
- Actionable improvement suggestions
- Complete pipeline execution with single flag
# Clone the repository
git clone https://github.com/mehmoodulhaq570/smart-datalyzer.git
cd smart-datalyzer
# Install build tools
pip install build
# Build the package
python -m build
# Install
pip install dist/smart_datalyzer-0.1.1-py3-none-any.whlpip install -e .python -m smart-datalyzer data.xlsx "target_column"Or using the installed command:
smart-datalyzer data.xlsx "target_column"python -m smart-datalyzer data.csv "target1" "target2" "target3"python -m smart-datalyzer <file> <target> [OPTIONS]
# or
smart-datalyzer <file> <target> [OPTIONS]
Arguments:
file Path to dataset (CSV or XLSX)
target Target column name(s) - space separated for multiple
Options:
--stats Run detailed statistical analysis
--outliers Detect and report outliers
--leakage Detect target leakage features
--imbalance Check class imbalance
--plots Generate all visualization plots
--report Generate interactive HTML/JSON report
--auto Run full automatic analysis (recommended)
--max_rows N Limit rows to read (default: 100000)
--output_dir DIR Output directory (default: "reports")Quick Analysis:
python -m smart-datalyzer sales.xlsx "Revenue" --autoDetailed Statistical Report:
smart-datalyzer customers.csv "Churn" --stats --plots --reportMultiple Targets with Custom Output:
smart-datalyzer experiment.xlsx "Outcome1" "Outcome2" --auto --output_dir resultsOutlier & Leakage Detection:
python -m smart-datalyzer medical.csv "Disease" --outliers --leakagereports/
โโโ plots/
โ โโโ *_distribution.png # Distribution histograms
โ โโโ *_boxplot.png # Box plots
โ โโโ *_violinplot.png # Violin plots
โ โโโ *_swarmplot.png # Swarm plots
โ โโโ *_qqplot.png # QQ plots
โ โโโ correlation_*.png # Correlation heatmaps
โ โโโ feature_importance.png # Feature importance chart
โ โโโ pca_variance.png # PCA analysis
โ โโโ tsne_scatter.png # t-SNE visualization
โโโ report.html # Interactive HTML report
โโโ summary.json # JSON summary statistics
โโโ covariance_matrix.csv # Covariance matrix
โโโ .cache/ # Analysis cache
from datalyzer.utils import load_dataset
from datalyzer.stats import feature_statistics, detect_outliers
from datalyzer.plots import plot_distributions, plot_correlation
# Load data
df = load_dataset("data.csv")
# Get statistics
stats, readiness, suggestions = feature_statistics(df)
print(f"ML Readiness Score: {readiness}/100")
# Detect outliers
outliers = detect_outliers(df, df.select_dtypes(include=['float64', 'int64']).columns)
# Generate plots
plot_paths = plot_distributions(df, plots_dir="reports/plots")
correlation_paths = plot_correlation(df, plots_dir="reports/plots")pandas- Data manipulationnumpy- Numerical computingscipy- Statistical functionsstatsmodels- Advanced statisticsscikit-learn- Machine learning utilitiesmatplotlib- Plotting backendseaborn- Statistical visualizationsrich- Terminal formatting
See requirements.txt for complete list.
Smart Datalyzer computes an ML readiness score (0-100) based on:
- Missing value percentage
- Constant features
- Numeric vs categorical balance
- Duplicate rows
- Data quality issues
Automatically caches analysis results using SHA256 hashing for:
- Faster re-analysis of same datasets
- Incremental updates
- Reduced computation time
Automatically detects and suggests:
- Numeric columns stored as strings
- Categorical features with high cardinality
- Date/time columns
- Mixed-type columns
Mehmood Ul Haq
Email: mehmoodulhaq1040@gmail.com
GitHub: @mehmoodulhaq570
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please read CODE_OF_CONDUCT.md first.
For security issues, please see SECURITY.md.
- Fixed swarm plot performance issues with large datasets (added sampling limit of 2000 points)
- Fixed filename sanitization for plots with special characters
- Improved visualization generation speed
- Skip class imbalance check for targets with >10 unique values
- Initial release
- Multiple target column support
- Comprehensive statistical analysis
- Advanced visualization suite
- Smart auto-analysis mode
- Caching system
- Interactive HTML reports
Built with modern Python data science stack and best practices for automated data analysis.