An intelligent web application that turns your raw data into actionable insights and analysis reports using a hybrid pipeline of traditional ML techniques and LLM-based natural language orchestration.
AI Data Analyst is an AI-powered Streamlit app that allows users to upload CSV datasets and request analysis tasks in natural language — such as:
- Exploratory Data Analysis (EDA)
- Regression Analysis (Linear)
- Classification Analysis (Logistic Regression)
- Predictive Modeling
The system intelligently interprets user intent using GPT, chooses the appropriate analysis pipeline, executes it, and generates:
- 📈 Visual plots
- 📊 Model metrics
- 📋 Explainable feature insights
- 📄 A full natural language report generated via GPT-3.5 Turbo, suitable for both technical and non-technical audiences
✅ Natural language driven orchestration
✅ GPT-based target column inference
✅ Automated pipeline selection (Regression / Classification / Decision Tree)
✅ Robust fallback handling for CSV file encoding
✅ Chat with Data mode using PandasAI integration
✅ Support for:
- Linear Regression
- Logistic Regression
- Actual vs Predicted (Regression)
- Confusion Matrix (Classification)
- Feature Importances (Tree / Coefficients)
Layer : Tools Used UI : Streamlit
ML Models: scikit-learn
Plots: Matplotlib, seaborn
Data Handling: Pandas
LLM Orchestration: OpenAI GPT 3.5-turbo
Data Chat Layer: PandasAI
Report Generation: GPT 3.5 + FPDF
- Bridges gap between technical data science and business reporting
- Provides transparent, explainable AI reports from raw data
- Eliminates the need for users to write code for EDA and modeling
- Boosts productivity for data teams and analysts
- Supports non-technical stakeholders by generating human-friendly summaries
- Enables interactive data exploration via Chat mode
- Demonstrates practical use of LLM orchestration patterns for real-world analytics
- Upload CSV
- Provide natural language task description
- Optional chat with PandasAI
- GPT analyzes task prompt and dataset schema
- Chooses appropriate pipeline:
- Linear Regression
- Logistic Regressions
- Data preprocessed (encoding handling, dummies)
- Model trained and evaluated
- Plots generated
- Actual vs Predicted
- Confusion Matrix
- Feature Importance
- Analysis results passed to GPT with structured prompts.
- GPT writes a multi-section report:
- Dataset summary
- Methodology
- Metrics
- Feature Insights
- Plain-English and Technical explanations
- Limitations
- Report saved as a PDF and displayed in app.
git clone https://github.com/DSM2499/AI_Data_Cleaner.git
cd AI_Data_Cleanerpython3 -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windowspip install -r requirements.txt- Create a .env file
OPEN_AI_KEY=sk-xxxxxxxxxxxxxxxxxxxxxx
streamlit run app.pyContributions welcome — feel free to open issues, fork the repo, and suggest improvements!