🚀 DataSpark - AI-Powered Data Preprocessing Platform

DataSpark is an AI-powered platform that automatically analyzes, cleans, and prepares CSV and image datasets for machine learning and analytics.

✨ Features

🎯 Core Capabilities

Smart CSV Analysis: AI-powered column analysis with quality scoring (0-100)
Image Dataset Processing: Batch processing of image archives with quality detection
Intelligent Recommendations: Context-aware preprocessing suggestions with confidence scores
Background Processing: Non-blocking job queue for large datasets
JWT Authentication: Secure token-based authentication
Analytics Dashboard: Usage statistics and performance metrics

📊 CSV Processing

Quality scoring and duplicate detection
15+ preprocessing actions (imputation, scaling, encoding, outlier handling)
Feature importance estimation
Class imbalance detection
Automatic pipeline generation

🖼️ Image Processing

Corrupted image detection
Format and dimension analysis
8+ transformations (resize, enhance, filter, rotate)
Batch processing from ZIP archives
Quality warnings and recommendations

🚀 Quick Start

Backend (FastAPI)

cd backend
pip install -r requirements.txt
cp .env.example .env
python -m app.main

Access: http://localhost:8000/docs

Frontend (Next.js) ✨ NEW!

cd frontend
npm install
cp .env.local.example .env.local
npm run dev

Access: http://localhost:3000

Docker (Full Stack)

cd backend
docker-compose up -d

📖 Detailed Guide: See QUICKSTART.md

📁 Project Structure

dataspark/
├── backend/                    # Backend API (FastAPI)
│   ├── app/
│   │   ├── core/              # Configuration, security, database
│   │   ├── models/            # SQLAlchemy ORM models
│   │   ├── schemas/           # Pydantic request/response models
│   │   ├── services/          # Business logic
│   │   ├── api/               # API endpoints
│   │   └── utils/             # Utility functions
│   ├── uploads/               # File storage
│   ├── Dockerfile             # Container definition
│   ├── docker-compose.yml     # Multi-service setup
│   └── requirements.txt       # Python dependencies
├── frontend/                   # Frontend (Next.js 14) ✨ NEW!
│   ├── app/                   # Next.js App Router
│   │   ├── page.tsx          # Landing page
│   │   ├── login/            # Login page
│   │   ├── register/         # Registration page
│   │   └── dashboard/        # Dashboard (protected)
│   ├── lib/                   # API client & utilities
│   ├── store/                 # State management (Zustand)
│   ├── components/            # Reusable components
│   └── package.json           # Node dependencies
├── PROGRESS.md                # Development progress tracker
├── QUICKSTART.md              # Quick start guide
└── README.md                  # This file

🎯 Use Cases

For Students & Beginners

Learn data preprocessing without coding
Understand dataset quality issues
Get AI-powered recommendations

For Data Analysts

Quick dataset exploration
Automated cleaning pipelines
Quality assessment reports

For ML Engineers

Rapid preprocessing for experiments
Feature engineering automation
Dataset preparation for training

For Hackathons

Fast data preparation
Professional API backend
Resume-worthy project

📊 API Endpoints

Authentication

POST /auth/register - Create account
POST /auth/login - Get JWT tokens
GET /auth/me - Get user info

CSV Processing

POST /csv/analyze - Upload and analyze CSV
POST /csv/process - Apply preprocessing actions

Image Processing

POST /images/analyze - Analyze image dataset
POST /images/process - Apply transformations
GET /images/jobs/{id}/status - Check job status
GET /images/jobs/{id}/download - Download results

Analytics

GET /api/history - Upload history
GET /api/analytics - Usage statistics

Full API Documentation: http://localhost:8000/docs (when running)

🛠️ Technology Stack

Backend

FastAPI - Modern, fast web framework
SQLAlchemy - SQL toolkit and ORM
Pydantic - Data validation
JWT - Secure authentication
Pandas - Data manipulation
scikit-learn - ML preprocessing
Pillow - Image processing

Database

SQLite - Development
PostgreSQL - Production (recommended)

Deployment

Docker - Containerization
Railway/Heroku - Cloud hosting
Vercel - Frontend hosting (Phase 5)

🎓 What Makes DataSpark Special?

1. AI-Powered Analysis

Not just basic statistics - DataSpark provides:

Quality scores based on multiple factors
Confidence levels for each recommendation
Feature importance estimation
Automatic issue detection

2. Production-Ready Architecture

Clean, maintainable code structure
Industry-standard authentication
Docker support
Comprehensive documentation
API-first design

3. Resume/Portfolio Ready

This project demonstrates:

Modern FastAPI development
Clean architecture principles
JWT authentication
Docker containerization
RESTful API design
ML/AI integration
Production deployment

📈 Development Status

Phase	Status	Description
Phase 1	✅ Complete	Codebase cleanup & refactor
Phase 2	✅ Complete	Authentication upgrade (JWT)
Phase 3	✅ Complete	AI-powered preprocessing
Phase 4	🔄 Planned	Background processing (Celery)
Phase 5	✅ Complete	Frontend (Next.js)
Phase 6	📋 Planned	Production deployment
Phase 7	📋 Planned	Observability & monitoring
Phase 8	📋 Planned	Testing & quality
Phase 9	📋 Planned	Polish & branding

Current Status: Backend 95% Complete | Frontend 40% Complete | Deployment 50%

📖 Detailed Progress: See PROGRESS.md

🚀 Deployment

Railway (Recommended)

Connect GitHub repository
Set environment variables from .env.example
Deploy automatically
Add PostgreSQL addon

Heroku

Create new app
Add PostgreSQL addon
Set config vars
Deploy from Git

Docker

docker build -t dataspark .
docker run -p 8000:8000 dataspark

🧪 Testing

Manual Testing

# Run the test script
python test_api.py

Using the Interactive Docs

Start the server
Open http://localhost:8000/docs
Try the endpoints directly in the browser

Using cURL

# Health check
curl http://localhost:8000/health

# Register user
curl -X POST http://localhost:8000/auth/register \
  -H "Content-Type: application/json" \
  -d '{"username":"test","email":"test@example.com","password":"test123"}'

🤝 Contributing

Contributions are welcome! Here's how:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: Check /docs endpoint when running
Quick Start: See QUICKSTART.md
Progress: See PROGRESS.md
Issues: Create a GitHub issue

🎯 Roadmap

Short Term (Next 2 Weeks)

Add Celery for background jobs
Implement rate limiting
Add more preprocessing actions
Create sample datasets

Medium Term (Next Month)

Build Next.js frontend
Add user dashboard
Implement file management
Add export formats (JSON, Parquet)

Long Term (Next 3 Months)

💡 Inspiration

DataSpark was built to solve a common problem: data preprocessing is tedious and time-consuming. By combining AI-powered analysis with automated preprocessing, DataSpark makes data preparation accessible to everyone.

🌟 Show Your Support

If you find DataSpark useful:

⭐ Star this repository
🐛 Report bugs
💡 Suggest features
🤝 Contribute code
📢 Share with others

👨‍💻 Author

Built with ❤️ for the data science community

📚 Learn More

Ready to transform your data preprocessing workflow? Get started now! 🚀

git clone <repository-url>
cd dataspark/backend
python -m app.main

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
backend		backend
frontend		frontend
ARCHITECTURE.md		ARCHITECTURE.md
CHECKLIST.md		CHECKLIST.md
DEPLOYMENT_GUIDE.md		DEPLOYMENT_GUIDE.md
DEPLOY_CHECKLIST.md		DEPLOY_CHECKLIST.md
FRONTEND_COMPLETE.md		FRONTEND_COMPLETE.md
FULL_STACK_GUIDE.md		FULL_STACK_GUIDE.md
PROGRESS.md		PROGRESS.md
PROJECT_STATUS.md		PROJECT_STATUS.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SUMMARY.md		SUMMARY.md
backend.zip		backend.zip

Folders and files

Latest commit

History

Repository files navigation

🚀 DataSpark - AI-Powered Data Preprocessing Platform

✨ Features

🎯 Core Capabilities

📊 CSV Processing

🖼️ Image Processing

🚀 Quick Start

Backend (FastAPI)

Frontend (Next.js) ✨ NEW!

Docker (Full Stack)

📁 Project Structure

🎯 Use Cases

For Students & Beginners

For Data Analysts

For ML Engineers

For Hackathons

📊 API Endpoints

Authentication

CSV Processing

Image Processing

Analytics

🛠️ Technology Stack

Backend

Database

Deployment

🎓 What Makes DataSpark Special?

1. AI-Powered Analysis

2. Production-Ready Architecture

3. Resume/Portfolio Ready

📈 Development Status

🚀 Deployment

Railway (Recommended)

Heroku

Docker

🧪 Testing

Manual Testing

Using the Interactive Docs

Using cURL

🤝 Contributing

📝 License

🆘 Support

🎯 Roadmap

Short Term (Next 2 Weeks)

Medium Term (Next Month)

Long Term (Next 3 Months)

💡 Inspiration

🌟 Show Your Support

👨‍💻 Author

📚 Learn More

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages