Skip to content

shankarrrrr/dataspark

Repository files navigation

πŸš€ DataSpark - AI-Powered Data Preprocessing Platform

DataSpark is an AI-powered platform that automatically analyzes, cleans, and prepares CSV and image datasets for machine learning and analytics.

Python FastAPI License Docker


✨ Features

🎯 Core Capabilities

  • Smart CSV Analysis: AI-powered column analysis with quality scoring (0-100)
  • Image Dataset Processing: Batch processing of image archives with quality detection
  • Intelligent Recommendations: Context-aware preprocessing suggestions with confidence scores
  • Background Processing: Non-blocking job queue for large datasets
  • JWT Authentication: Secure token-based authentication
  • Analytics Dashboard: Usage statistics and performance metrics

πŸ“Š CSV Processing

  • Quality scoring and duplicate detection
  • 15+ preprocessing actions (imputation, scaling, encoding, outlier handling)
  • Feature importance estimation
  • Class imbalance detection
  • Automatic pipeline generation

πŸ–ΌοΈ Image Processing

  • Corrupted image detection
  • Format and dimension analysis
  • 8+ transformations (resize, enhance, filter, rotate)
  • Batch processing from ZIP archives
  • Quality warnings and recommendations

πŸš€ Quick Start

Backend (FastAPI)

cd backend
pip install -r requirements.txt
cp .env.example .env
python -m app.main

Access: http://localhost:8000/docs

Frontend (Next.js) ✨ NEW!

cd frontend
npm install
cp .env.local.example .env.local
npm run dev

Access: http://localhost:3000

Docker (Full Stack)

cd backend
docker-compose up -d

πŸ“– Detailed Guide: See QUICKSTART.md


πŸ“ Project Structure

dataspark/
β”œβ”€β”€ backend/                    # Backend API (FastAPI)
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ core/              # Configuration, security, database
β”‚   β”‚   β”œβ”€β”€ models/            # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ schemas/           # Pydantic request/response models
β”‚   β”‚   β”œβ”€β”€ services/          # Business logic
β”‚   β”‚   β”œβ”€β”€ api/               # API endpoints
β”‚   β”‚   └── utils/             # Utility functions
β”‚   β”œβ”€β”€ uploads/               # File storage
β”‚   β”œβ”€β”€ Dockerfile             # Container definition
β”‚   β”œβ”€β”€ docker-compose.yml     # Multi-service setup
β”‚   └── requirements.txt       # Python dependencies
β”œβ”€β”€ frontend/                   # Frontend (Next.js 14) ✨ NEW!
β”‚   β”œβ”€β”€ app/                   # Next.js App Router
β”‚   β”‚   β”œβ”€β”€ page.tsx          # Landing page
β”‚   β”‚   β”œβ”€β”€ login/            # Login page
β”‚   β”‚   β”œβ”€β”€ register/         # Registration page
β”‚   β”‚   └── dashboard/        # Dashboard (protected)
β”‚   β”œβ”€β”€ lib/                   # API client & utilities
β”‚   β”œβ”€β”€ store/                 # State management (Zustand)
β”‚   β”œβ”€β”€ components/            # Reusable components
β”‚   └── package.json           # Node dependencies
β”œβ”€β”€ PROGRESS.md                # Development progress tracker
β”œβ”€β”€ QUICKSTART.md              # Quick start guide
└── README.md                  # This file

🎯 Use Cases

For Students & Beginners

  • Learn data preprocessing without coding
  • Understand dataset quality issues
  • Get AI-powered recommendations

For Data Analysts

  • Quick dataset exploration
  • Automated cleaning pipelines
  • Quality assessment reports

For ML Engineers

  • Rapid preprocessing for experiments
  • Feature engineering automation
  • Dataset preparation for training

For Hackathons

  • Fast data preparation
  • Professional API backend
  • Resume-worthy project

πŸ“Š API Endpoints

Authentication

  • POST /auth/register - Create account
  • POST /auth/login - Get JWT tokens
  • GET /auth/me - Get user info

CSV Processing

  • POST /csv/analyze - Upload and analyze CSV
  • POST /csv/process - Apply preprocessing actions

Image Processing

  • POST /images/analyze - Analyze image dataset
  • POST /images/process - Apply transformations
  • GET /images/jobs/{id}/status - Check job status
  • GET /images/jobs/{id}/download - Download results

Analytics

  • GET /api/history - Upload history
  • GET /api/analytics - Usage statistics

Full API Documentation: http://localhost:8000/docs (when running)


πŸ› οΈ Technology Stack

Backend

  • FastAPI - Modern, fast web framework
  • SQLAlchemy - SQL toolkit and ORM
  • Pydantic - Data validation
  • JWT - Secure authentication
  • Pandas - Data manipulation
  • scikit-learn - ML preprocessing
  • Pillow - Image processing

Database

  • SQLite - Development
  • PostgreSQL - Production (recommended)

Deployment

  • Docker - Containerization
  • Railway/Heroku - Cloud hosting
  • Vercel - Frontend hosting (Phase 5)

πŸŽ“ What Makes DataSpark Special?

1. AI-Powered Analysis

Not just basic statistics - DataSpark provides:

  • Quality scores based on multiple factors
  • Confidence levels for each recommendation
  • Feature importance estimation
  • Automatic issue detection

2. Production-Ready Architecture

  • Clean, maintainable code structure
  • Industry-standard authentication
  • Docker support
  • Comprehensive documentation
  • API-first design

3. Resume/Portfolio Ready

This project demonstrates:

  • Modern FastAPI development
  • Clean architecture principles
  • JWT authentication
  • Docker containerization
  • RESTful API design
  • ML/AI integration
  • Production deployment

πŸ“ˆ Development Status

Phase Status Description
Phase 1 βœ… Complete Codebase cleanup & refactor
Phase 2 βœ… Complete Authentication upgrade (JWT)
Phase 3 βœ… Complete AI-powered preprocessing
Phase 4 πŸ”„ Planned Background processing (Celery)
Phase 5 βœ… Complete Frontend (Next.js)
Phase 6 πŸ“‹ Planned Production deployment
Phase 7 πŸ“‹ Planned Observability & monitoring
Phase 8 πŸ“‹ Planned Testing & quality
Phase 9 πŸ“‹ Planned Polish & branding

Current Status: Backend 95% Complete | Frontend 40% Complete | Deployment 50%

πŸ“– Detailed Progress: See PROGRESS.md


πŸš€ Deployment

Railway (Recommended)

  1. Connect GitHub repository
  2. Set environment variables from .env.example
  3. Deploy automatically
  4. Add PostgreSQL addon

Heroku

  1. Create new app
  2. Add PostgreSQL addon
  3. Set config vars
  4. Deploy from Git

Docker

docker build -t dataspark .
docker run -p 8000:8000 dataspark

πŸ§ͺ Testing

Manual Testing

# Run the test script
python test_api.py

Using the Interactive Docs

  1. Start the server
  2. Open http://localhost:8000/docs
  3. Try the endpoints directly in the browser

Using cURL

# Health check
curl http://localhost:8000/health

# Register user
curl -X POST http://localhost:8000/auth/register \
  -H "Content-Type: application/json" \
  -d '{"username":"test","email":"test@example.com","password":"test123"}'

🀝 Contributing

Contributions are welcome! Here's how:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ†˜ Support

  • Documentation: Check /docs endpoint when running
  • Quick Start: See QUICKSTART.md
  • Progress: See PROGRESS.md
  • Issues: Create a GitHub issue

🎯 Roadmap

Short Term (Next 2 Weeks)

  • Add Celery for background jobs
  • Implement rate limiting
  • Add more preprocessing actions
  • Create sample datasets

Medium Term (Next Month)

  • Build Next.js frontend
  • Add user dashboard
  • Implement file management
  • Add export formats (JSON, Parquet)

Long Term (Next 3 Months)

  • Deploy to production
  • Add team collaboration features
  • Implement API versioning
  • Add webhook support
  • Create mobile app

πŸ’‘ Inspiration

DataSpark was built to solve a common problem: data preprocessing is tedious and time-consuming. By combining AI-powered analysis with automated preprocessing, DataSpark makes data preparation accessible to everyone.


🌟 Show Your Support

If you find DataSpark useful:

  • ⭐ Star this repository
  • πŸ› Report bugs
  • πŸ’‘ Suggest features
  • 🀝 Contribute code
  • πŸ“’ Share with others

πŸ‘¨β€πŸ’» Author

Built with ❀️ for the data science community


πŸ“š Learn More


Ready to transform your data preprocessing workflow? Get started now! πŸš€

git clone <repository-url>
cd dataspark/backend
python -m app.main

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors