DataSpark is an AI-powered platform that automatically analyzes, cleans, and prepares CSV and image datasets for machine learning and analytics.
- Smart CSV Analysis: AI-powered column analysis with quality scoring (0-100)
- Image Dataset Processing: Batch processing of image archives with quality detection
- Intelligent Recommendations: Context-aware preprocessing suggestions with confidence scores
- Background Processing: Non-blocking job queue for large datasets
- JWT Authentication: Secure token-based authentication
- Analytics Dashboard: Usage statistics and performance metrics
- Quality scoring and duplicate detection
- 15+ preprocessing actions (imputation, scaling, encoding, outlier handling)
- Feature importance estimation
- Class imbalance detection
- Automatic pipeline generation
- Corrupted image detection
- Format and dimension analysis
- 8+ transformations (resize, enhance, filter, rotate)
- Batch processing from ZIP archives
- Quality warnings and recommendations
cd backend
pip install -r requirements.txt
cp .env.example .env
python -m app.mainAccess: http://localhost:8000/docs
cd frontend
npm install
cp .env.local.example .env.local
npm run devAccess: http://localhost:3000
cd backend
docker-compose up -dπ Detailed Guide: See QUICKSTART.md
dataspark/
βββ backend/ # Backend API (FastAPI)
β βββ app/
β β βββ core/ # Configuration, security, database
β β βββ models/ # SQLAlchemy ORM models
β β βββ schemas/ # Pydantic request/response models
β β βββ services/ # Business logic
β β βββ api/ # API endpoints
β β βββ utils/ # Utility functions
β βββ uploads/ # File storage
β βββ Dockerfile # Container definition
β βββ docker-compose.yml # Multi-service setup
β βββ requirements.txt # Python dependencies
βββ frontend/ # Frontend (Next.js 14) β¨ NEW!
β βββ app/ # Next.js App Router
β β βββ page.tsx # Landing page
β β βββ login/ # Login page
β β βββ register/ # Registration page
β β βββ dashboard/ # Dashboard (protected)
β βββ lib/ # API client & utilities
β βββ store/ # State management (Zustand)
β βββ components/ # Reusable components
β βββ package.json # Node dependencies
βββ PROGRESS.md # Development progress tracker
βββ QUICKSTART.md # Quick start guide
βββ README.md # This file
- Learn data preprocessing without coding
- Understand dataset quality issues
- Get AI-powered recommendations
- Quick dataset exploration
- Automated cleaning pipelines
- Quality assessment reports
- Rapid preprocessing for experiments
- Feature engineering automation
- Dataset preparation for training
- Fast data preparation
- Professional API backend
- Resume-worthy project
POST /auth/register- Create accountPOST /auth/login- Get JWT tokensGET /auth/me- Get user info
POST /csv/analyze- Upload and analyze CSVPOST /csv/process- Apply preprocessing actions
POST /images/analyze- Analyze image datasetPOST /images/process- Apply transformationsGET /images/jobs/{id}/status- Check job statusGET /images/jobs/{id}/download- Download results
GET /api/history- Upload historyGET /api/analytics- Usage statistics
Full API Documentation: http://localhost:8000/docs (when running)
- FastAPI - Modern, fast web framework
- SQLAlchemy - SQL toolkit and ORM
- Pydantic - Data validation
- JWT - Secure authentication
- Pandas - Data manipulation
- scikit-learn - ML preprocessing
- Pillow - Image processing
- SQLite - Development
- PostgreSQL - Production (recommended)
- Docker - Containerization
- Railway/Heroku - Cloud hosting
- Vercel - Frontend hosting (Phase 5)
Not just basic statistics - DataSpark provides:
- Quality scores based on multiple factors
- Confidence levels for each recommendation
- Feature importance estimation
- Automatic issue detection
- Clean, maintainable code structure
- Industry-standard authentication
- Docker support
- Comprehensive documentation
- API-first design
This project demonstrates:
- Modern FastAPI development
- Clean architecture principles
- JWT authentication
- Docker containerization
- RESTful API design
- ML/AI integration
- Production deployment
| Phase | Status | Description |
|---|---|---|
| Phase 1 | β Complete | Codebase cleanup & refactor |
| Phase 2 | β Complete | Authentication upgrade (JWT) |
| Phase 3 | β Complete | AI-powered preprocessing |
| Phase 4 | π Planned | Background processing (Celery) |
| Phase 5 | β Complete | Frontend (Next.js) |
| Phase 6 | π Planned | Production deployment |
| Phase 7 | π Planned | Observability & monitoring |
| Phase 8 | π Planned | Testing & quality |
| Phase 9 | π Planned | Polish & branding |
Current Status: Backend 95% Complete | Frontend 40% Complete | Deployment 50%
π Detailed Progress: See PROGRESS.md
- Connect GitHub repository
- Set environment variables from
.env.example - Deploy automatically
- Add PostgreSQL addon
- Create new app
- Add PostgreSQL addon
- Set config vars
- Deploy from Git
docker build -t dataspark .
docker run -p 8000:8000 dataspark# Run the test script
python test_api.py- Start the server
- Open http://localhost:8000/docs
- Try the endpoints directly in the browser
# Health check
curl http://localhost:8000/health
# Register user
curl -X POST http://localhost:8000/auth/register \
-H "Content-Type: application/json" \
-d '{"username":"test","email":"test@example.com","password":"test123"}'Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Check
/docsendpoint when running - Quick Start: See QUICKSTART.md
- Progress: See PROGRESS.md
- Issues: Create a GitHub issue
- Add Celery for background jobs
- Implement rate limiting
- Add more preprocessing actions
- Create sample datasets
- Build Next.js frontend
- Add user dashboard
- Implement file management
- Add export formats (JSON, Parquet)
- Deploy to production
- Add team collaboration features
- Implement API versioning
- Add webhook support
- Create mobile app
DataSpark was built to solve a common problem: data preprocessing is tedious and time-consuming. By combining AI-powered analysis with automated preprocessing, DataSpark makes data preparation accessible to everyone.
If you find DataSpark useful:
- β Star this repository
- π Report bugs
- π‘ Suggest features
- π€ Contribute code
- π’ Share with others
Built with β€οΈ for the data science community
Ready to transform your data preprocessing workflow? Get started now! π
git clone <repository-url>
cd dataspark/backend
python -m app.main