🚀 Crypto MLOps Dashboard

Live Demo: https://crypto.gokuladethya.uk/ Preview: https://crypto.gokuladethya.uk/howitworks

Project: Crypto MLOps Dashboard — Real-time BTC/USDT price and sentiment prediction.

Dashboard Code: https://github.com/FrozenWolf-Cyber/crypto

For installation instructions, see INSTALL.md.

🏗️ Project Overview

Crypto MLOps Dashboard is a fully automated MLOps pipeline that predicts crypto trading signals (Buy, Hold, Sell) using:

Time-series price data from Binance API
Textual sentiment data from Yahoo Finance (scraped using Playwright & BeautifulSoup4)

It combines reinforcement-tuned FinBERT, LightGBM, and TST models, with fully automated training, inference, and monitoring across Kubernetes.

Stack: Python, FastAPI, Kafka, Airflow, MLflow, ONNX, Prometheus + Grafana, Vast.ai, PostgreSQL, S3, Nginx, Cert-Manager

🖼️ Orchestration Overview

🧠 Idea Overview

We scrape article data, Bitcoin prices, and other related statistics at regular intervals to power a continuous MLOps pipeline. The system trains three core model types:

TRL (LLM Classifier for Articles) — trained using Gradient-Regularized Policy Optimization (GRPO) to align textual sentiment with market reactions.
LightGBM — trained on tabular, price-based features for short-term movement prediction.
Time Series Transformer (TST) — processes temporal sequences to capture price trend dynamics.

Each model maintains three active versions to monitor distribution shifts and performance degradation over time. All three versions of all three models (total of nine live predictions) are continuously updated and stored in PostgreSQL, which serves data to the analytics dashboard.

⚙️ Deployment & Architecture

The pipeline aims for near-zero downtime, scalability, and rapid model refresh. For instance, if LightGBM completes retraining early, it can be deployed immediately without waiting for other models or disrupting ongoing inference.

To achieve this:

Kafka is used for asynchronous producer-consumer message streaming.
S3-hosted MLflow handles model registry, version tracking, and logging.
Hot-swapping is performed as soon as a model finishes training, avoiding downtime.
Prometheus + Grafana provide real-time observability for inference latency and system load.
Nginx Ingress Reverse Proxy secures endpoints and routes traffic efficiently.
Airflow DAGs orchestrate retraining, data updates, and deployment cycles.

Additionally, historical predictions are reconciled whenever a model version is updated to maintain backward consistency between inferences.

☸️ Why Kubernetes?

I adopted Kubernetes (K8s) primarily to experiment with scalable training and deployment orchestration, and to explore GPU-based pod scheduling.

The original design included connecting Vast.ai nodes as additional GPU-enabled workers in the same cluster, dynamically spinning up training pods through the Kubernetes Operator based on GPU VRAM availability.

However, this approach faced several challenges:

VM Dependency — Kubernetes requires VMs instead of lightweight instances, increasing startup time, instability, and cost.
GPU Resource Scheduling — The official NVIDIA GPU plugin only supports A/H-series GPUs, not cheaper consumer GPUs.
Third-party plugins — Existing GPU schedulers were unstable and often incompatible with mixed GPU environments.
Custom Scheduler via Airflow — Bypassing K8s GPU scheduling was possible but cost-inefficient and complex to maintain.

Despite these challenges, the system serves as a strong experimental sandbox for learning scalable MLOps principles — including distributed training, versioned deployment, and automated monitoring across heterogeneous infrastructure.

Folder Structure

crypto-mlops
├── dags/
│   ├── cleanup_DAG.py                # training event cleanup
│   ├── DAG.py                        # model_training_pipeline: trains LightGBM, TST, TRL models every 5 days; includes consumer_start_dag (resume producer/consumer on startup) and delete_all_process (kill all producer/consumer)
│   └── trl_infer_Dag.py              # trl_inference_pipeline: scrapes past news data, compiles using trl_onnx_maker, performs inference
│
├── k8-setup/
│   ├── airflow-ingress.yaml          # reverse proxy for Airflow (TODO)
│   ├── backend.yaml                  # backend for crypto dashboard + reverse proxy
│   ├── cluster-issuer.yaml           # auto issues/renews TLS/SSL certs (HTTPS) in cluster
│   ├── fast-api.yaml                 # hosts model inference for consumer process, loads latest models from S3 + MLflow, exposes Prometheus hooks
│   ├── generate_secrets.py           # dynamically generates platform-secrets.yaml from OS env vars; assigns credentials for entire cluster
│   ├── grafana-config.yaml           # Grafana hosting configs
│   ├── grafaingress.yaml             # reverse proxy for Grafana
│   ├── kafka-service.yaml            # exposes Kafka to cluster pods
│   ├── kafka.yaml                    # external Docker-based Kafka (default Quixstream template)
│   ├── ml-flow.yaml                  # MLflow hosting pod + service exposure to other pods
│   ├── mlflow_bootstrap.py           # post-reverse-proxy script: updates admin creds, sets public view password, creates secret endpoint for VastAI logging
│   ├── mlflow-ingress.yaml           # reverse proxy for MLflow
│   ├── pod_template.yaml             # pod template for Airflow KubernetesExecutor tasks
│   ├── producer-consumer.yaml        # pod running producer and consumer outside Airflow
│   ├── pvc.yaml                      # PV/PVC claims for persistent storage
│   └── values.yaml                   # Airflow configuration values
│
├── utils/
│   ├── articles_runner/
│   │   ├── past_news_scrape.py       # news scraper using Playwright
│   │   └── scrape.py                 # scraping utilities; saves scraped news locally as CSV
│   │
│   ├── artifact_control/
│   │   ├── model_manager.py          # handles MLflow model logging, promotion, archival in S3 registry
│   │   └── s3_manager.py             # upload/download predictions, price data, etc.
│   │
│   ├── database/
│   │   ├── airflow_db.py             # training status DB updated by VastAI; monitored by training DAG
│   │   ├── db.py                     # handles CRUD ops on crypto prices, predictions, and articles
│   │   └── status_db.py              # DAG hook that logs each task’s status and errors
│   │
│   ├── dockerfile/
│   │   ├── backend.dockerfile        # FastAPI backend for crypto dashboard
│   │   ├── main.py                   # backend server entrypoint
│   │   ├── modelserve.dockerfile     # FastAPI model serving image
│   │   ├── producer_consumer.dockerfile # producer-consumer pod image
│   │   └── worker.dockerfile         # Airflow worker image
│   │
│   ├── producer_consumer/
│   │   ├── consumer_start.py         # resumes producer-consumer setup on pod restart/new setup
│   │   ├── consumer_utils.py         # handles read/write states for communication between independent processes and worker pods
│   │   ├── consumer.py               # consumer triggered per version/model/crypto; listens to topic, performs inference via FastAPI, pushes results to DB, self-deletes after completion
│   │   ├── job_handler.py            # creates producer/consumer by invoking scripts when job is received from worker
│   │   ├── kill_all.py               # kills all active producers and consumers
│   │   └── producer.py               # producer for each coin; pushes messages to topic and writes locally
│   │
│   ├── serve/
│   │   ├── fastapi_app.py            # FastAPI model server; supports model refresh from MLflow/S3; monitored by Prometheus
│   │   ├── trl_inference.py          # loads compiled ONNX models to perform inference
│   │   └── trl_onnx_maker.py         # detects new model pushes, compiles new ONNX models
│   │
│   ├── trainer/                      # training scripts; train_parallel.py triggers concurrent model training (LightGBM/TST/TRL) with early stopping and safe completion; updates status; performs post-train inference and S3 upload
│   │
│   └── utils/
│       ├── kill_vast_ai_instances.py # terminates all VastAI instances
│       ├── post_train_reconcile.py   # reconciles post-train predictions with DB/CSV; kills old consumers, spins new ones for updated model
│       ├── post_train_trl.py         # same as above, but for TRL model
│       ├── pre_train_dataset.py      # slices last 6 months’ data before training; pushes to S3 for VastAI pods
│       └── vast_ai_train.py          # selects budget-efficient VastAI pod; blacklists malformed pods

📊 Label Generation

1️⃣ Time-Series Labels

Derived from BTC/USDT price trends
Assigns Buy, Hold, or Sell based on percentage price change

2️⃣ Text-Based Labels

Scraped articles analyzed for average price change over 12 hours
Labels Buy, Hold, or Sell correspond to market reaction

🏋️‍♂️ Model Training

Time-Series

TST (Temporal Signal Transformer)
LightGBM

Text-Based (TRL-FinBERT + GRPO + LoRA)

Fine-tunes FinBERT using Gradient-Regularized Policy Optimization (GRPO)
Uses LoRA adapters for parameter-efficient training
Treats financial sentiment classification as RL problem with price-change reward signals

Pseudo Algorithm:

logits_old = theta_old(texts)
probs_old = softmax(logits_old)
sampled_actions = sample_actions(probs_old, group_size)

rewards = compute_rewards(sampled_actions, true_labels, price_changes, normalizer)

logits_new = policy(texts)
logp_new = log_probs_from_logits(logits_new, sampled_actions)
logp_old = log_probs_from_logits(logits_old, sampled_actions)

advantages = rewards - group_mean(rewards)

ratio = exp(logp_new - logp_old)
loss_surrogate = -mean(min(ratio * advantages,
                           clip(ratio, 1-clip_eps, 1+clip_eps) * advantages))

kl_loss = KL(policy(texts), reference_model(texts))
loss = loss_surrogate + kl_coef * kl_loss

🗄️ Databases

DB	Purpose
Predictions DB	Latest crypto prices & model predictions
Vast.ai Training DB	Tracks GPU pod training status
Airflow DB	DAG events & monitoring
MLflow PSQL	Centralized model registry

🧱 Model Versioning

Managed via MLflow (S3 backend)
v1: Initial fixed model
v2 & v3: Active production
10 archived models retained
Public MLflow Viewer: https://mlflow.gokuladethya.uk/

⚡ Model Inference (FastAPI)

ONNX-compiled models served on dedicated FastAPI pod
/refresh endpoint reloads models from MLflow dynamically
Instrumented with Prometheus + Grafana monitoring
Hosted behind Nginx ingress

🔄 End-to-End Orchestration

Workflow:

Producer: Fetches Binance data → Kafka topic
Consumer: Listens → inference → updates PostgreSQL (model × version × currency)
Training:
- Slice last 6 months → S3
- Spin Vast.ai GPU pod → train → log to MLflow
- Upload model & predictions to S3
- Airflow monitors pod status & enforces timeouts

Post-Training Reconciliation:

1. Stop old consumers (x-v-2, x-v-3)
2. Pull new FastAPI models from MLflow (/refresh)
3. Rename local predictions (x-v-3 → x-v-2)
4. Start consumers for x-v-2
5. Download new predictions (x-v-3) from S3
6. Infer delta test data → push to PostgreSQL
7. Start consumers for x-v-3
8. Upload updated predictions to S3 (v2 & v3)

📰 TRL Orchestration

Runs every 30 mins for news scraping & inference
Compiles ONNX if not cached
Pushes predictions to PostgreSQL

☸️ Kubernetes Architecture

Namespaces & Pods:

platform: MLflow, FastAPI, Dashboard backend, Producer-Consumer, Airflow
ingress-nginx: Reverse proxy for endpoints
cert-manager: TLS support
prometheus: Prometheus + Grafana dashboards
Kafka (dockerized): Separate persistent setup

Grafana: https://grafana.gokuladethya.uk/

DAG:

Training Pipeline DAG:

Name		Name	Last commit message	Last commit date
Latest commit History 577 Commits
dags		dags
demo		demo
k8-setup		k8-setup
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
install.md		install.md
install.sh		install.sh
train_require.sh		train_require.sh
train_start_time.txt		train_start_time.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Crypto MLOps Dashboard

🏗️ Project Overview

🖼️ Orchestration Overview

🧠 Idea Overview

⚙️ Deployment & Architecture

☸️ Why Kubernetes?

Folder Structure

📊 Label Generation

1️⃣ Time-Series Labels

2️⃣ Text-Based Labels

🏋️‍♂️ Model Training

Time-Series

Text-Based (TRL-FinBERT + GRPO + LoRA)

🗄️ Databases

🧱 Model Versioning

⚡ Model Inference (FastAPI)

🔄 End-to-End Orchestration

📰 TRL Orchestration

☸️ Kubernetes Architecture

DAG:

🖼️ Dashboard Screenshots

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Crypto MLOps Dashboard

🏗️ Project Overview

🖼️ Orchestration Overview

🧠 Idea Overview

⚙️ Deployment & Architecture

☸️ Why Kubernetes?

Folder Structure

📊 Label Generation

1️⃣ Time-Series Labels

2️⃣ Text-Based Labels

🏋️‍♂️ Model Training

Time-Series

Text-Based (TRL-FinBERT + GRPO + LoRA)

🗄️ Databases

🧱 Model Versioning

⚡ Model Inference (FastAPI)

🔄 End-to-End Orchestration

📰 TRL Orchestration

☸️ Kubernetes Architecture

DAG:

🖼️ Dashboard Screenshots

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages