Click Through Rate Prediction with Wide and Deep Neural Network on Kubernetes (K8s)

Introduction

In this project, I show how to build a machine learning workflow and deploy it on Kubernetes to load a large dataset (40 million+ records), preprocess it and perform feature engineering, and finally train a wide and deep model and evaluate model performance using a separate test dataset.

There are three main objectives in this project:

Train a SOTA Wide and Deep neural network that combines the strengths of memorization and generalization to make click-through rate (CTR) prediction (for an introduction of CTR prediction, check out this article).
Demo a) how to leverage distributed computing frameworks like Apache Spark to handle big datasets, perform ETL jobs efficiently, and generate features that can be used for downstream ML tasks; b) how to use deep learning frameworks like PyTorch to build a neural network model architecture from scratch, train it with a GPU accelerator, and then evaluate the model performance using a separate dataset.
The advent of cloud computing greatly facilitates ML workflows. Both the spark job and the pytorch job are encapsulated in Docker containers and deployed on a multi-nodal Kubernetes cluster, using the spark-on-k8s-operator and the pytorch operator respectively. The deployment of a complete ML workflow on Kubernetes clusters substantially improves cost efficiency and time efficiency, proves scalable when handling large workloads, and makes the containerized models easy to deliver.

Dataset

The dataset I used in this project is the Avazu CTR dataset from Kaggle. The dataset contains categorical raw features and a binary label column. For more details about the dataset and the rationale behind feature engineering in this project, please check out this notebook.

Approach

Step 1 - Feature engineering with Apache Spark

The Spark job python script essentially handles all the ETL work and feature engineering. The raw data is stored in a distributed file system (GCS in this project) which is loaded in the spark job. The single and cross-product wide features are one-hot encoded and the deep features are label encoded (to get embeddings later during training). The preprocessed features are saved on GCS.

The python script is stored in a Docker container and will be executed when the K8s pods run the container. The Dockerfile of the spark job image can be found here. The Dockerfile of the base image can be found in this repository and is an adaptation from the official spark operator image, which is configured to run on Google Cloud Platform.

The spark job configuration is here and the specs can be modified easily to scale up the requested resources. Due to limited quota, I used 2 executors each with 3 cores.

To start the Spark job, run kubectl create -f ./k8s_jobs/ctr-spark-job.yaml

Step 2 - Training wide and deep model with PyTorch

Just like the Spark job, the python script of the PyTorch job is stored in a container built with the simple Dockerfile. During model training, the model reads feature inputs in batches directly from GCS, performs forward propagation, and then update parameters during back propagation.

Due to limited quota, I used a single GPU node as the master node and num_workers=4. In production, this can be easily scaled up by modifying the PyTorch job configuration.

To start the PyTorch job, run kubectl create -f ./k8s_jobs/ctr-pytorch-job.yaml

Checking the logs of the running pods should show you something like below:

Results

A step-by-step notebook that runs the training on a local GPU is also provided in this repository. When training locally, the best logloss from the test dataset is 0.3958. The deployed PyTorch job on Kubernetes used different batch size which results in a better logloss of 0.3946.

ROC_AUC was selected as the evaluation metric (see this notebook for the reasons). After 15 epochs of training on a local GPU, the wide and deep model results in ROC_AUC of 0.7497 - a significant improvement (~5.4%) from gradient boosting tree model trained in Spark which has an ROC_AUC score of 0.7112.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
gif		gif
k8s_jobs		k8s_jobs
pyspark_docker		pyspark_docker
pytorch_docker		pytorch_docker
.gitignore		.gitignore
README.md		README.md
bazel-gazelle-v0.21.1.tar.gz		bazel-gazelle-v0.21.1.tar.gz
bazel-gazelle-v0.22.3.tar.gz		bazel-gazelle-v0.22.3.tar.gz
compute_roc_auc.ipynb		compute_roc_auc.ipynb
pinterest_bazel_rules_f82ae5a9880525a01ebe175bfe076b40ff821acd.tar.gz		pinterest_bazel_rules_f82ae5a9880525a01ebe175bfe076b40ff821acd.tar.gz
pytorch-wide-and-deep.ipynb		pytorch-wide-and-deep.ipynb
pytorch-wide-and-deep.py		pytorch-wide-and-deep.py
rules_cc-9ddf8aac74201a57ce37f8ef29a01d3e69318aac.tar.gz		rules_cc-9ddf8aac74201a57ce37f8ef29a01d3e69318aac.tar.gz
rules_docker-v0.14.4.tar.gz		rules_docker-v0.14.4.tar.gz
rules_go-70b8365a.tar.gz		rules_go-70b8365a.tar.gz
rules_go-v0.25.1.tar.gz		rules_go-v0.25.1.tar.gz
rules_go-v0.28.0.tar.gz		rules_go-v0.28.0.tar.gz
rules_pkg-0.2.6.tar.gz		rules_pkg-0.2.6.tar.gz
saved_model.pt		saved_model.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Click Through Rate Prediction with Wide and Deep Neural Network on Kubernetes (K8s)

Introduction

Dataset

Approach

Step 1 - Feature engineering with Apache Spark

Step 2 - Training wide and deep model with PyTorch

Results

About

Uh oh!

Releases

Packages

Languages

yinanli617/ctr-prediction

Folders and files

Latest commit

History

Repository files navigation

Click Through Rate Prediction with Wide and Deep Neural Network on Kubernetes (K8s)

Introduction

Dataset

Approach

Step 1 - Feature engineering with Apache Spark

Step 2 - Training wide and deep model with PyTorch

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages