Skip to content

shashwat051102/Stock_Sage_ML_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📈 StockSage: End-to-End LSTM Stock Price Prediction Pipeline

StockSage is a robust, reproducible machine learning pipeline for stock price prediction using LSTM neural networks. The project leverages DVC for data and model versioning, MLflow for experiment tracking, Hyperopt for automated hyperparameter tuning, and DagsHub for collaborative data science and remote storage.


🚀 Features

  • LSTM Neural Network for time series regression
  • Automated Hyperparameter Tuning with Hyperopt
  • Experiment Tracking with MLflow
  • Reproducible Pipelines using DVC
  • Remote Data & Model Storage with DagsHub
  • Robust Data Preprocessing and validation
  • Easy Configuration via params.yaml

🗂️ Project Structure

StockSage/
├── data/
│   ├── raw/
│   │   └── data.csv
│   └── processed/
│       └── data.csv
├── models/
│   └── model.h5
├── src/
│   ├── preprocess.py
│   ├── train.py
│   └── evaluate.py
├── params.yaml
├── dvc.yaml
├── requirements.txt
├── .env
└── README.md

⚙️ Setup & Installation

1. Clone the Repository

git clone https://github.com/yourusername/StockSage.git
cd StockSage

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment Variables

Create a .env file in the project root:

MLFLOW_TRACKING_URI=http://your-mlflow-server:5000
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_password

4. Prepare Data

  • Place your raw stock data as data/raw/data.csv.
  • The file must include a CloseUSD column (target) and any number of numeric feature columns.

☁️ DagsHub Integration

This project uses DagsHub for:

  • Remote DVC storage: Store and version datasets and models in the cloud.
  • Collaboration: Share experiments, data, and models with your team.
  • Experiment tracking: Integrate MLflow and DVC for a seamless MLOps experience.

To use DagsHub as your DVC remote:

dvc remote add origin https://dagshub.com/<username>/<repo>.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user <your-dagshub-username>
dvc remote modify origin --local password <your-dagshub-token>

Push your data and models to DagsHub:

dvc push

🏃 Pipeline Usage


📝 DVC Stage Commands

To manually add pipeline stages (already present in dvc.yaml):

dvc stage add -n preprocess \
    -p preprocess.input,preprocess.output \
    -d src/preprocess.py -d data/raw/data.csv \
    -o data/processed/data.csv \
    python src/preprocess.py

dvc stage add -n train \
    -p train.data,train.model \
    -d src/train.py -d data/processed/data.csv \
    -o models/model.h5 \
    python src/train.py

dvc stage add -n evaluate \
    -d src/evaluate.py -d models/model.h5 -d data/processed/data.csv \
    python src/evaluate.py

Run the Full Pipeline

dvc repro

This will:

  1. Preprocess the data
  2. Train the LSTM model with hyperparameter tuning
  3. Evaluate the model and log metrics

Run Stages Individually

dvc repro preprocess
dvc repro train
dvc repro evaluate

Or run scripts directly:

python src/preprocess.py
python src/train.py
python src/evaluate.py

📋 Configuration

params.yaml

preprocess:
  input: data/raw/data.csv
  output: data/processed/data.csv

train:
  data: data/processed/data.csv
  model: models/model.h5
  learning_rate: 0.001
  momentum: 0.9

dvc.yaml

Defines the pipeline stages and their dependencies.


🧠 Model & Training

  • Model: 2-layer LSTM with Dropout and Dense output
  • Input: All features are numeric, reshaped for LSTM
  • Loss: Mean Squared Error (MSE)
  • Optimizer: SGD (learning rate and momentum tuned)
  • Metrics: Root Mean Squared Error (RMSE), accuracy (rounded, optional)
  • Hyperparameters Tuned: LSTM units, dropout, learning rate, momentum, batch size, epochs

📊 Experiment Tracking

  • MLflow logs all hyperparameters, metrics, and models.

  • Access the MLflow UI with:

    mlflow ui

    Then visit http://localhost:5000 (or your configured URI).

  • DagsHub can also visualize MLflow experiments and DVC data lineage in the cloud.


🐍 Example Code Snippet

data = pd.read_csv(data_path)
X = data.drop("CloseUSD", axis=1)
y = data["CloseUSD"]

# Ensure all features are numeric
X = X.apply(pd.to_numeric, errors='coerce').fillna(0)
y = y.fillna(0)

# Split and reshape for LSTM
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
train_X_lstm = np.expand_dims(X_train.values, axis=2)
# ... model definition and training ...

🐛 Troubleshooting

  • Keras model saving error:
    Ensure model_path ends with .h5 or .keras.

  • DVC parameter errors:
    Remove unused parameters from dvc.yaml or add them to params.yaml.

  • MLflow connection issues:
    Check your .env file and MLflow server status.

  • NaN or dtype errors:
    Ensure all features are numeric and fill missing values before training.

  • DagsHub authentication issues:
    Make sure your DagsHub token is correct and you have access to the repository.


📑 License

MIT License


🙏 Acknowledgments


Happy Predicting!

Releases

No releases published

Packages

No packages published

Languages