A comprehensive, AI-powered data validation and quality assessment application built with Streamlit and Google Gemini 2.5 Flash. This tool helps data scientists and analysts quickly identify data quality issues, generate validation reports, and receive actionable recommendations for data improvement.
- Multi-format Data Support: Load CSV, JSON, and Excel files seamlessly
- Comprehensive Data Validation: 5 key validation categories
- AI-Powered Insights: Intelligent summaries and recommendations using Google Gemini 2.5 Flash
- Interactive Dashboard: Modern, responsive Streamlit interface
- Professional PDF Reports: Export detailed validation reports
- 📊 Data Type Validation: Verify data types match expected schemas
- 📏 Range Validation: Check numeric values against min/max constraints
- 🔄 Duplicate Detection: Identify duplicate values in unique constraint columns
- ❓ Missing Value Analysis: Comprehensive null value assessment
- 🔤 Format Validation: Regex pattern matching for data formats
- Real-time quality scoring with visual metrics
- Column-wise detailed analysis with interactive charts
- Executive summary with AI-generated insights
- Data improvement recommendations
- Professional PDF report generation with charts and insights
pip install streamlit pandas matplotlib google-generativeai python-dotenv IPython reportlab plotly numpy markdown2 graphviz- Google Gemini API Key: Required for AI-powered insights and recommendations
-
Clone the repository
git clone https://github.com/rebel47/data-validation.git cd data-validation -
Install dependencies
pip install streamlit pandas matplotlib google-generativeai python-dotenv IPython reportlab plotly numpy markdown2 graphviz
-
Set up environment variables
Create a
.envfile in the project root (adjust path as needed):# Create secrets.env or .env file GEMINI_API_KEY=your_gemini_api_key_here -
Run the application
streamlit run app.py
Create a secrets.env file (or adjust the path in the code) with:
GEMINI_API_KEY=your_google_gemini_api_key- Visit Google AI Studio
- Create an account or sign in
- Generate a new API key
- Add the key to your environment file
- Launch the application using
streamlit run app.py - Upload your dataset (CSV, JSON, or Excel format)
- Review the data preview and basic statistics
- Navigate through validation tabs to explore different quality checks
- Generate and download comprehensive PDF reports
- Data Preview: Quick overview of your dataset structure
- Assessment Summary: AI-powered overview with quality scoring
- Data Type Test: Schema validation and type checking
- Range Test: Numeric boundary validation
- Duplicates Test: Uniqueness constraint verification
- Missing Values Test: Null value analysis
- Format Test: Pattern and format validation
- Click "Download PDF Report" to generate a comprehensive validation report
- Reports include executive summaries, detailed findings, and AI-generated recommendations
- All charts and metrics are embedded in the PDF
data-validation/
├── app.py # Main Streamlit application
├── auth.py # Authentication module (if applicable)
├── secrets.env # Environment variables (create this)
├── requirements.txt # Python dependencies (optional)
└── README.md # This file
- Data Scientists: Validate datasets before model training
- Data Engineers: Quality checks in ETL pipelines
- Business Analysts: Ensure data integrity for reporting
- Data Stewards: Comprehensive data governance and quality monitoring
- Teams: Generate stakeholder-ready validation reports
The application leverages Google Gemini 2.5 Flash to provide:
- Executive Summaries: High-level data quality assessments
- Detailed Insights: Column-specific analysis and recommendations
- Improvement Recommendations: Actionable steps to enhance data quality
- Business Impact Analysis: Understanding the implications of data issues
The tool provides detailed validation results including:
- ✅ Pass/Fail Status: Clear indicators for each validation check
- 📊 Quality Scores: Percentage-based quality metrics
- 📈 Visual Charts: Interactive pie charts and progress indicators
- 📝 Detailed Reports: Comprehensive analysis with AI insights
- 🎯 Actionable Recommendations: Specific steps to improve data quality
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Data Models: Custom classes replacing Pydantic models
- Validation Engine: Comprehensive rule-based validation system
- AI Integration: Google Gemini API for intelligent insights
- Report Generation: ReportLab-based PDF creation
- UI Components: Modern Streamlit interface with custom CSS
API Key Error
Error: Google Gemini API key not found
Solution: Ensure GEMINI_API_KEY is set in your environment file
File Upload Error
Error: Unsupported file format
Solution: Use CSV, JSON, or Excel (.xlsx, .xls) files only
Memory Issues with Large Files
Solution: Consider chunking large datasets or increasing system memory
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and code comments
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit: For the amazing web app framework
- Google Gemini: For powerful AI capabilities
- ReportLab: For PDF generation capabilities
- Plotly: For interactive visualizations
- Pandas: For data manipulation and analysis
Built with ❤️ by rebel47
If you find this project helpful, please consider giving it a ⭐ star on GitHub!