Live Demo: https://crypto.gokuladethya.uk/ Preview: https://crypto.gokuladethya.uk/howitworks
Project: Crypto MLOps Dashboard — Real-time BTC/USDT price and sentiment prediction.
Dashboard Code: https://github.com/FrozenWolf-Cyber/crypto
For installation instructions, see INSTALL.md.
Crypto MLOps Dashboard is a fully automated MLOps pipeline that predicts crypto trading signals (Buy, Hold, Sell) using:
- Time-series price data from Binance API
- Textual sentiment data from Yahoo Finance (scraped using Playwright & BeautifulSoup4)
It combines reinforcement-tuned FinBERT, LightGBM, and TST models, with fully automated training, inference, and monitoring across Kubernetes.
Stack: Python, FastAPI, Kafka, Airflow, MLflow, ONNX, Prometheus + Grafana, Vast.ai, PostgreSQL, S3, Nginx, Cert-Manager
We scrape article data, Bitcoin prices, and other related statistics at regular intervals to power a continuous MLOps pipeline. The system trains three core model types:
- TRL (LLM Classifier for Articles) — trained using Gradient-Regularized Policy Optimization (GRPO) to align textual sentiment with market reactions.
- LightGBM — trained on tabular, price-based features for short-term movement prediction.
- Time Series Transformer (TST) — processes temporal sequences to capture price trend dynamics.
Each model maintains three active versions to monitor distribution shifts and performance degradation over time. All three versions of all three models (total of nine live predictions) are continuously updated and stored in PostgreSQL, which serves data to the analytics dashboard.
The pipeline aims for near-zero downtime, scalability, and rapid model refresh. For instance, if LightGBM completes retraining early, it can be deployed immediately without waiting for other models or disrupting ongoing inference.
To achieve this:
- Kafka is used for asynchronous producer-consumer message streaming.
- S3-hosted MLflow handles model registry, version tracking, and logging.
- Hot-swapping is performed as soon as a model finishes training, avoiding downtime.
- Prometheus + Grafana provide real-time observability for inference latency and system load.
- Nginx Ingress Reverse Proxy secures endpoints and routes traffic efficiently.
- Airflow DAGs orchestrate retraining, data updates, and deployment cycles.
Additionally, historical predictions are reconciled whenever a model version is updated to maintain backward consistency between inferences.
I adopted Kubernetes (K8s) primarily to experiment with scalable training and deployment orchestration, and to explore GPU-based pod scheduling.
The original design included connecting Vast.ai nodes as additional GPU-enabled workers in the same cluster, dynamically spinning up training pods through the Kubernetes Operator based on GPU VRAM availability.
However, this approach faced several challenges:
- VM Dependency — Kubernetes requires VMs instead of lightweight instances, increasing startup time, instability, and cost.
- GPU Resource Scheduling — The official NVIDIA GPU plugin only supports A/H-series GPUs, not cheaper consumer GPUs.
- Third-party plugins — Existing GPU schedulers were unstable and often incompatible with mixed GPU environments.
- Custom Scheduler via Airflow — Bypassing K8s GPU scheduling was possible but cost-inefficient and complex to maintain.
Despite these challenges, the system serves as a strong experimental sandbox for learning scalable MLOps principles — including distributed training, versioned deployment, and automated monitoring across heterogeneous infrastructure.
crypto-mlops
├── dags/
│ ├── cleanup_DAG.py # training event cleanup
│ ├── DAG.py # model_training_pipeline: trains LightGBM, TST, TRL models every 5 days; includes consumer_start_dag (resume producer/consumer on startup) and delete_all_process (kill all producer/consumer)
│ └── trl_infer_Dag.py # trl_inference_pipeline: scrapes past news data, compiles using trl_onnx_maker, performs inference
│
├── k8-setup/
│ ├── airflow-ingress.yaml # reverse proxy for Airflow (TODO)
│ ├── backend.yaml # backend for crypto dashboard + reverse proxy
│ ├── cluster-issuer.yaml # auto issues/renews TLS/SSL certs (HTTPS) in cluster
│ ├── fast-api.yaml # hosts model inference for consumer process, loads latest models from S3 + MLflow, exposes Prometheus hooks
│ ├── generate_secrets.py # dynamically generates platform-secrets.yaml from OS env vars; assigns credentials for entire cluster
│ ├── grafana-config.yaml # Grafana hosting configs
│ ├── grafaingress.yaml # reverse proxy for Grafana
│ ├── kafka-service.yaml # exposes Kafka to cluster pods
│ ├── kafka.yaml # external Docker-based Kafka (default Quixstream template)
│ ├── ml-flow.yaml # MLflow hosting pod + service exposure to other pods
│ ├── mlflow_bootstrap.py # post-reverse-proxy script: updates admin creds, sets public view password, creates secret endpoint for VastAI logging
│ ├── mlflow-ingress.yaml # reverse proxy for MLflow
│ ├── pod_template.yaml # pod template for Airflow KubernetesExecutor tasks
│ ├── producer-consumer.yaml # pod running producer and consumer outside Airflow
│ ├── pvc.yaml # PV/PVC claims for persistent storage
│ └── values.yaml # Airflow configuration values
│
├── utils/
│ ├── articles_runner/
│ │ ├── past_news_scrape.py # news scraper using Playwright
│ │ └── scrape.py # scraping utilities; saves scraped news locally as CSV
│ │
│ ├── artifact_control/
│ │ ├── model_manager.py # handles MLflow model logging, promotion, archival in S3 registry
│ │ └── s3_manager.py # upload/download predictions, price data, etc.
│ │
│ ├── database/
│ │ ├── airflow_db.py # training status DB updated by VastAI; monitored by training DAG
│ │ ├── db.py # handles CRUD ops on crypto prices, predictions, and articles
│ │ └── status_db.py # DAG hook that logs each task’s status and errors
│ │
│ ├── dockerfile/
│ │ ├── backend.dockerfile # FastAPI backend for crypto dashboard
│ │ ├── main.py # backend server entrypoint
│ │ ├── modelserve.dockerfile # FastAPI model serving image
│ │ ├── producer_consumer.dockerfile # producer-consumer pod image
│ │ └── worker.dockerfile # Airflow worker image
│ │
│ ├── producer_consumer/
│ │ ├── consumer_start.py # resumes producer-consumer setup on pod restart/new setup
│ │ ├── consumer_utils.py # handles read/write states for communication between independent processes and worker pods
│ │ ├── consumer.py # consumer triggered per version/model/crypto; listens to topic, performs inference via FastAPI, pushes results to DB, self-deletes after completion
│ │ ├── job_handler.py # creates producer/consumer by invoking scripts when job is received from worker
│ │ ├── kill_all.py # kills all active producers and consumers
│ │ └── producer.py # producer for each coin; pushes messages to topic and writes locally
│ │
│ ├── serve/
│ │ ├── fastapi_app.py # FastAPI model server; supports model refresh from MLflow/S3; monitored by Prometheus
│ │ ├── trl_inference.py # loads compiled ONNX models to perform inference
│ │ └── trl_onnx_maker.py # detects new model pushes, compiles new ONNX models
│ │
│ ├── trainer/ # training scripts; train_parallel.py triggers concurrent model training (LightGBM/TST/TRL) with early stopping and safe completion; updates status; performs post-train inference and S3 upload
│ │
│ └── utils/
│ ├── kill_vast_ai_instances.py # terminates all VastAI instances
│ ├── post_train_reconcile.py # reconciles post-train predictions with DB/CSV; kills old consumers, spins new ones for updated model
│ ├── post_train_trl.py # same as above, but for TRL model
│ ├── pre_train_dataset.py # slices last 6 months’ data before training; pushes to S3 for VastAI pods
│ └── vast_ai_train.py # selects budget-efficient VastAI pod; blacklists malformed pods
- Derived from BTC/USDT price trends
- Assigns
Buy,Hold, orSellbased on percentage price change
- Scraped articles analyzed for average price change over 12 hours
- Labels
Buy,Hold, orSellcorrespond to market reaction
- TST (Temporal Signal Transformer)
- LightGBM
- Fine-tunes FinBERT using Gradient-Regularized Policy Optimization (GRPO)
- Uses LoRA adapters for parameter-efficient training
- Treats financial sentiment classification as RL problem with price-change reward signals
Pseudo Algorithm:
logits_old = theta_old(texts)
probs_old = softmax(logits_old)
sampled_actions = sample_actions(probs_old, group_size)
rewards = compute_rewards(sampled_actions, true_labels, price_changes, normalizer)
logits_new = policy(texts)
logp_new = log_probs_from_logits(logits_new, sampled_actions)
logp_old = log_probs_from_logits(logits_old, sampled_actions)
advantages = rewards - group_mean(rewards)
ratio = exp(logp_new - logp_old)
loss_surrogate = -mean(min(ratio * advantages,
clip(ratio, 1-clip_eps, 1+clip_eps) * advantages))
kl_loss = KL(policy(texts), reference_model(texts))
loss = loss_surrogate + kl_coef * kl_loss| DB | Purpose |
|---|---|
| Predictions DB | Latest crypto prices & model predictions |
| Vast.ai Training DB | Tracks GPU pod training status |
| Airflow DB | DAG events & monitoring |
| MLflow PSQL | Centralized model registry |
- Managed via MLflow (S3 backend)
- v1: Initial fixed model
- v2 & v3: Active production
- 10 archived models retained
- Public MLflow Viewer: https://mlflow.gokuladethya.uk/
- ONNX-compiled models served on dedicated FastAPI pod
/refreshendpoint reloads models from MLflow dynamically- Instrumented with Prometheus + Grafana monitoring
- Hosted behind Nginx ingress
Workflow:
-
Producer: Fetches Binance data → Kafka topic
-
Consumer: Listens → inference → updates PostgreSQL (
model × version × currency) -
Training:
- Slice last 6 months → S3
- Spin Vast.ai GPU pod → train → log to MLflow
- Upload model & predictions to S3
- Airflow monitors pod status & enforces timeouts
Post-Training Reconciliation:
1. Stop old consumers (x-v-2, x-v-3)
2. Pull new FastAPI models from MLflow (/refresh)
3. Rename local predictions (x-v-3 → x-v-2)
4. Start consumers for x-v-2
5. Download new predictions (x-v-3) from S3
6. Infer delta test data → push to PostgreSQL
7. Start consumers for x-v-3
8. Upload updated predictions to S3 (v2 & v3)
- Runs every 30 mins for news scraping & inference
- Compiles ONNX if not cached
- Pushes predictions to PostgreSQL
Namespaces & Pods:
- platform: MLflow, FastAPI, Dashboard backend, Producer-Consumer, Airflow
- ingress-nginx: Reverse proxy for endpoints
- cert-manager: TLS support
- prometheus: Prometheus + Grafana dashboards
- Kafka (dockerized): Separate persistent setup
Grafana: https://grafana.gokuladethya.uk/



















