A comprehensive Streamlit application for analyzing dataset quality across multiple dimensions.
-
Multi-Dimensional Analysis
- Completeness / Missingness
- Consistency / Validity
- Uniqueness / Duplicacy
- Outlier Patterns / Anomalies
- Bias / Class Imbalance
- Temporal Coverage / Stability
- Cardinality / Feature Sparsity
-
Context-Aware Scoring
- Historical/Analytical
- Real-time/Streaming
- Customer/Marketing
- Finance/Risk
- Custom/Other
-
Smart Detection
- Auto-detect task type (Classification, Regression, Clustering)
- Auto-detect column types (numeric, categorical, datetime, text)
- Auto-suggest target columns
-
Interactive Dashboards
- Dimension-specific breakdowns with detailed visualizations
- Per-dimension descriptions explaining what each metric means
- Actionable recommendations based on findings
- Clone the repository
- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtRun the application:
streamlit run app.pyThen open your browser to http://localhost:8501
- Upload: Select CSV/Excel file (max 150MB)
- Configure: Choose use case and review auto-detected task type
- Review: View all dimension scores in one place
- Explore: Click "Explore Detailed Breakdown" to dive into each dimension
- Act: Get actionable recommendations based on findings
- Completeness: Percentage of non-missing values. High = minimal gaps and NaN values.
- Consistency: Data type validity and format correctness. High = values match expected types.
- Uniqueness: Duplicate detection and data redundancy. High = minimal duplicates.
- Outliers: Anomalous value detection using statistical methods. High = few extreme values.
- Bias: Class distribution and protected attribute representation. High = fair segment representation.
- Temporal: Time coverage and data freshness. High = good date coverage and recent data.
- Cardinality: Feature diversity and sparsity. High = appropriate feature variation.
data-quality-system/
├── app.py
├── config.py
├── requirements.txt
├── pages/
│ ├── 1_home.py
│ ├── 2_overview.py
│ ├── 3_dashboard.py
│ └── 4_recommendations.py
├── modules/
│ ├── file_handler.py
│ ├── auto_detector.py
│ ├── aggregator.py
│ ├── recommendations.py
│ └── scorers/
│ ├── completeness.py
│ ├── consistency.py
│ ├── uniqueness.py
│ ├── outliers.py
│ ├── bias.py
│ ├── temporal.py
│ └── cardinality.py
├── ui/
│ ├── components.py
│ ├── visualizations.py
│ └── __init__.py
└── utils/
├── helpers.py
├── constants.py
└── __init__.py
Edit config.py to customize:
- Quality thresholds per use case
- Dimension weights for scoring
- Missing value patterns
- File size limits
- Color schemes
- CSV (.csv)
- Excel (.xlsx, .xls)
- Max File Size: 150MB
- Recommended: <100MB for optimal performance
- No overall quality score - focuses on individual dimension analysis
- Each dimension is scored independently (0-100%)
- Dimension descriptions provided in dashboard for context
- Recommendations adjust based on use case and task type