Skip to content

An app that analyzes dataset quality across 7 dimensions (completeness, consistency, uniqueness, outliers, bias, temporal coverage, and cardinality) to identify data issues before analysis or model training. Context-aware scoring adapts thresholds based on use case, with interactive dashboards and actionable recommendations for data improvement.

Notifications You must be signed in to change notification settings

kaverikb/data-quality-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Quality Analysis System

A comprehensive Streamlit application for analyzing dataset quality across multiple dimensions.

Features

  • Multi-Dimensional Analysis

    • Completeness / Missingness
    • Consistency / Validity
    • Uniqueness / Duplicacy
    • Outlier Patterns / Anomalies
    • Bias / Class Imbalance
    • Temporal Coverage / Stability
    • Cardinality / Feature Sparsity
  • Context-Aware Scoring

    • Historical/Analytical
    • Real-time/Streaming
    • Customer/Marketing
    • Finance/Risk
    • Custom/Other
  • Smart Detection

    • Auto-detect task type (Classification, Regression, Clustering)
    • Auto-detect column types (numeric, categorical, datetime, text)
    • Auto-suggest target columns
  • Interactive Dashboards

    • Dimension-specific breakdowns with detailed visualizations
    • Per-dimension descriptions explaining what each metric means
    • Actionable recommendations based on findings

Installation

  1. Clone the repository
  2. Create virtual environment:
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
   pip install -r requirements.txt

Usage

Run the application:

streamlit run app.py

Then open your browser to http://localhost:8501

Workflow

  1. Upload: Select CSV/Excel file (max 150MB)
  2. Configure: Choose use case and review auto-detected task type
  3. Review: View all dimension scores in one place
  4. Explore: Click "Explore Detailed Breakdown" to dive into each dimension
  5. Act: Get actionable recommendations based on findings

Dimensions Explained

  • Completeness: Percentage of non-missing values. High = minimal gaps and NaN values.
  • Consistency: Data type validity and format correctness. High = values match expected types.
  • Uniqueness: Duplicate detection and data redundancy. High = minimal duplicates.
  • Outliers: Anomalous value detection using statistical methods. High = few extreme values.
  • Bias: Class distribution and protected attribute representation. High = fair segment representation.
  • Temporal: Time coverage and data freshness. High = good date coverage and recent data.
  • Cardinality: Feature diversity and sparsity. High = appropriate feature variation.

File Structure

data-quality-system/
├── app.py
├── config.py
├── requirements.txt
├── pages/
│   ├── 1_home.py
│   ├── 2_overview.py
│   ├── 3_dashboard.py
│   └── 4_recommendations.py
├── modules/
│   ├── file_handler.py
│   ├── auto_detector.py
│   ├── aggregator.py
│   ├── recommendations.py
│   └── scorers/
│       ├── completeness.py
│       ├── consistency.py
│       ├── uniqueness.py
│       ├── outliers.py
│       ├── bias.py
│       ├── temporal.py
│       └── cardinality.py
├── ui/
│   ├── components.py
│   ├── visualizations.py
│   └── __init__.py
└── utils/
    ├── helpers.py
    ├── constants.py
    └── __init__.py

Configuration

Edit config.py to customize:

  • Quality thresholds per use case
  • Dimension weights for scoring
  • Missing value patterns
  • File size limits
  • Color schemes

Supported Formats

  • CSV (.csv)
  • Excel (.xlsx, .xls)

Limits

  • Max File Size: 150MB
  • Recommended: <100MB for optimal performance

Notes

  • No overall quality score - focuses on individual dimension analysis
  • Each dimension is scored independently (0-100%)
  • Dimension descriptions provided in dashboard for context
  • Recommendations adjust based on use case and task type

About

An app that analyzes dataset quality across 7 dimensions (completeness, consistency, uniqueness, outliers, bias, temporal coverage, and cardinality) to identify data issues before analysis or model training. Context-aware scoring adapts thresholds based on use case, with interactive dashboards and actionable recommendations for data improvement.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages