StockSage is a robust, reproducible machine learning pipeline for stock price prediction using LSTM neural networks. The project leverages DVC for data and model versioning, MLflow for experiment tracking, Hyperopt for automated hyperparameter tuning, and DagsHub for collaborative data science and remote storage.
- LSTM Neural Network for time series regression
- Automated Hyperparameter Tuning with Hyperopt
- Experiment Tracking with MLflow
- Reproducible Pipelines using DVC
- Remote Data & Model Storage with DagsHub
- Robust Data Preprocessing and validation
- Easy Configuration via
params.yaml
StockSage/
├── data/
│ ├── raw/
│ │ └── data.csv
│ └── processed/
│ └── data.csv
├── models/
│ └── model.h5
├── src/
│ ├── preprocess.py
│ ├── train.py
│ └── evaluate.py
├── params.yaml
├── dvc.yaml
├── requirements.txt
├── .env
└── README.md
git clone https://github.com/yourusername/StockSage.git
cd StockSagepip install -r requirements.txtCreate a .env file in the project root:
MLFLOW_TRACKING_URI=http://your-mlflow-server:5000
MLFLOW_TRACKING_USERNAME=your_username
MLFLOW_TRACKING_PASSWORD=your_password
- Place your raw stock data as
data/raw/data.csv. - The file must include a
CloseUSDcolumn (target) and any number of numeric feature columns.
This project uses DagsHub for:
- Remote DVC storage: Store and version datasets and models in the cloud.
- Collaboration: Share experiments, data, and models with your team.
- Experiment tracking: Integrate MLflow and DVC for a seamless MLOps experience.
To use DagsHub as your DVC remote:
dvc remote add origin https://dagshub.com/<username>/<repo>.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user <your-dagshub-username>
dvc remote modify origin --local password <your-dagshub-token>Push your data and models to DagsHub:
dvc pushTo manually add pipeline stages (already present in dvc.yaml):
dvc stage add -n preprocess \
-p preprocess.input,preprocess.output \
-d src/preprocess.py -d data/raw/data.csv \
-o data/processed/data.csv \
python src/preprocess.py
dvc stage add -n train \
-p train.data,train.model \
-d src/train.py -d data/processed/data.csv \
-o models/model.h5 \
python src/train.py
dvc stage add -n evaluate \
-d src/evaluate.py -d models/model.h5 -d data/processed/data.csv \
python src/evaluate.pydvc reproThis will:
- Preprocess the data
- Train the LSTM model with hyperparameter tuning
- Evaluate the model and log metrics
dvc repro preprocess
dvc repro train
dvc repro evaluateOr run scripts directly:
python src/preprocess.py
python src/train.py
python src/evaluate.pypreprocess:
input: data/raw/data.csv
output: data/processed/data.csv
train:
data: data/processed/data.csv
model: models/model.h5
learning_rate: 0.001
momentum: 0.9Defines the pipeline stages and their dependencies.
- Model: 2-layer LSTM with Dropout and Dense output
- Input: All features are numeric, reshaped for LSTM
- Loss: Mean Squared Error (MSE)
- Optimizer: SGD (learning rate and momentum tuned)
- Metrics: Root Mean Squared Error (RMSE), accuracy (rounded, optional)
- Hyperparameters Tuned: LSTM units, dropout, learning rate, momentum, batch size, epochs
-
MLflow logs all hyperparameters, metrics, and models.
-
Access the MLflow UI with:
mlflow ui
Then visit http://localhost:5000 (or your configured URI).
-
DagsHub can also visualize MLflow experiments and DVC data lineage in the cloud.
data = pd.read_csv(data_path)
X = data.drop("CloseUSD", axis=1)
y = data["CloseUSD"]
# Ensure all features are numeric
X = X.apply(pd.to_numeric, errors='coerce').fillna(0)
y = y.fillna(0)
# Split and reshape for LSTM
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
train_X_lstm = np.expand_dims(X_train.values, axis=2)
# ... model definition and training ...-
Keras model saving error:
Ensuremodel_pathends with.h5or.keras. -
DVC parameter errors:
Remove unused parameters fromdvc.yamlor add them toparams.yaml. -
MLflow connection issues:
Check your.envfile and MLflow server status. -
NaN or dtype errors:
Ensure all features are numeric and fill missing values before training. -
DagsHub authentication issues:
Make sure your DagsHub token is correct and you have access to the repository.
MIT License
Happy Predicting!