A comprehensive data science project covering EDA, feature engineering, classical ML, and neural networks β all applied to real financial time-series data.
This project aims to predict whether a stockβs closing price will go UP or DOWN the next trading day using historical OHLCV (Open, High, Low, Close, Volume) data. We follow a full pipeline:
- β Exploratory Data Analysis (EDA) β Understand data structure, trends, and seasonality
- π§Ή Data Preprocessing & Feature Engineering β Create technical indicators, handle missing values, create lag features
- π Statistical Analysis & Visualization β Correlation, stationarity tests, returns distribution
- π€ Classical Machine Learning β Train Logistic Regression, Random Forest, XGBoost models
- π§ Deep Learning (LSTM/GRU) β Build sequence models for time-series forecasting
- π Model Evaluation & Comparison β Compare performance across models using time-based splits
We use the S&P 500 Stock Prices (2010β2024) dataset from Kaggle.
dateβ Trading dateopen,high,low,closeβ Daily price levelsvolumeβ Number of shares tradedNameβ Ticker symbol (e.g., AAPL, MSFT)
π‘ Note: Your Excel screenshot shows
all_stocks_5yr.csvβ this is likely the same dataset. Weβll load it into pandas for analysis.
- Language: Python 3.9+
- Libraries:
pandas,numpyβ Data handlingmatplotlib,seaborn,plotlyβ Visualizationscikit-learnβ Classical MLtensorflow/kerasβ Neural Networks (LSTM)statsmodels,scipyβ Statistical teststa(Technical Analysis library) β Feature engineering
FinPredict/
β
βββ data/ # Raw and processed datasets
βββ notebooks/ # Jupyter notebooks for each phase
β βββ 01_eda.ipynb
β βββ 02_preprocessing.ipynb
β βββ 03_statistics_visualization.ipynb
β βββ 04_machine_learning.ipynb
β βββ 05_neural_networks.ipynb
βββ models/ # Saved trained models
βββ utils/ # Helper functions (feature engineering, plotting, etc.)
βββ README.md # This file
βββ requirements.txt # Python dependencies
- Clone this repository
- Install dependencies:
pip install -r requirements.txt
- Download the dataset from Kaggle and place it in
data/all_stocks_5yr.csv - Open
notebooks/01_eda.ipynbto begin!
- Stationarity of returns?
- Most predictive features?
- Best performing model? (XGBoost vs LSTM)
- Accuracy on test set?
- Add fundamental data (P/E ratio, EPS) from Yahoo Finance
- Try Transformer models for multi-stock forecasting
- Deploy model via Streamlit or FastAPI
- Backtest trading strategy based on predictions
Madhav Madupu | [LinkedIn/GitHub] | Date: December 10, 2025
π Built for learning, portfolio showcase, and real-world finance applications.