This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:
- Running the model locally
- Turning the ML model into an API
- Containerizing the model
- Storing the container in a registry
- Deploying the model to Kubernetes (with Seldon Core)
- Scaling the model
The tutorial comes with an accompanying video which you might find useful as you work through the steps:
The slides used in the video can be found here.
For this tutorial, we're going to use the Cassava dataset available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).
We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the model details here.
The easiest way to run this example is to clone the repository. Once you've done that, you can just run:
pip install -r requirements.txt
And it'll set you up with all the libraries required to run the code.
The starting point for this tutorial is python script app.py
. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:
from helpers import plot, preprocess
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
# Fixes an issue with Jax and TF competing for GPU
tf.config.experimental.set_visible_devices([], 'GPU')
# Load the model
model_path = './model'
classifier = hub.KerasLayer(model_path)
# Load the dataset and store the class names
dataset, info = tfds.load('cassava', with_info=True)
class_names = info.features['label'].names + ['unknown']
# Select a batch of examples and plot them
batch_size = 9
batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
examples = next(batch)
plot(examples, class_names)
# Generate predictions for the batch and plot them against their labels
predictions = classifier(examples['image'])
predictions_max = tf.argmax(predictions, axis=-1)
print(predictions_max)
plot(examples, class_names, predictions_max)
First up, we're importing a couple of functions from our helpers.py
file:
plot
provides the visualisation of the samples, labels and predictions.preprocess
is used to resize images to 224x224 pixels and normalize the RGB values.
The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.
Try it yourself by running:
python app.py
Here's what our setup currently looks like:
The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.
Typically people turn to popular python web servers like Flask or FastAPI. This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source MLServer framework.
MLServer supports a bunch of inference runtimes out of the box, but it also supports custom python code which is what we'll use for our Tensorflow model.
In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load()
and predict()
. Let's take a look at the code (found in model/serve-model.py
):
from mlserver import MLModel
from mlserver.codecs import decode_args
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
# Define a class for our Model, inheriting the MLModel class from MLServer
class CassavaModel(MLModel):
# Load the model into memory
async def load(self) -> bool:
tf.config.experimental.set_visible_devices([], 'GPU')
model_path = '.'
self._model = hub.KerasLayer(model_path)
self.ready = True
return self.ready
# Logic for making predictions against our model
@decode_args
async def predict(self, payload: np.ndarray) -> np.ndarray:
# convert payload to tf.tensor
payload_tensor = tf.constant(payload)
# Make predictions
predictions = self._model(payload_tensor)
predictions_max = tf.argmax(predictions, axis=-1)
# convert predictions to np.ndarray
response_data = np.array(predictions_max)
return response_data
The load()
method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model
. The predict()
method is where we include all of our prediction logic.
You may notice that we've slightly modified our code from earlier (in app.py
). The biggest change is that it is now wrapped in a single class CassavaModel
.
The only other task we need to do to run our model on MLServer is to specify a model-settings.json
file:
{
"name": "cassava",
"implementation": "serve-model.CassavaModel"
}
This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel
).
We're now ready to serve our model with MLServer. To do that we can simply run:
mlserver start model/
MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.
Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py
file by running:
python test.py --local
Our setup has now evloved and looks like this:
Containers are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.
Note: you will need Docker installed to run this section of the tutorial. You'll also need a docker hub account or another container registry.
Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build
command.
Before we run this command, we need to provide our dependencies in either a requirements.txt
or a conda.env
file. The requirements file we'll use for this example is stored in model/requirements.txt
:
tensorflow==2.12.0
tensorflow-hub==0.13.0
Notice that we didn't need to include
mlserver
in our requirements? That's because the builder image has mlserver included already.
We're now ready to build our container image using:
mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
MLServer will now build the model into a container image for us. We can check the output of this by running:
docker images
Finally, we want to send this container image to be stored in our container registry. We can do this by running:
docker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
Our setup now looks like this. Where our model has been packaged and sent to a container registry:
Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.
We're going to use a popular open source framework called Seldon Core to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from Kubernetes but it also adds machine-learning specific features.
This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the installation instructions and get set up first. You'll also need to install the kubectl
command line interface.
To create our deployment with Seldon Core we need to create a small configuration file that looks like this:
You can find this file named deployment.yaml
in the base folder of this tutorial's repository.
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: cassava
spec:
protocol: v2
predictors:
- componentSpecs:
- spec:
containers:
- image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
name: cassava
imagePullPolicy: Always
graph:
name: cassava
type: MODEL
name: cassava
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:
kubectl create -f deployment.yaml
To check our deployment is up and running we can run:
kubectl get pods
We should see STATUS = Running
once our deployment has finalized.
Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.
To do this, we simply run the test.py
file in the following way:
python test.py --remote
This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.
A note on running this yourself:
This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you port forward before sending requests. If your cluster is remote, you'll need to change the inference_url
variable on line 21 of test.py
.
Having deployed our model to kubernetes and tested it, our setup now looks like this:
Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model horizontally.
Kubernetes and Seldon Core make this really easy to do by simply running:
kubectl scale sdep cassava --replicas=3
We can replace the --replicas=3
with any number we want to scale to.
To watch the servers scaling out we can run:
kubectl get pods --watch
Once the new replicas have finished rolling out, our setup now looks like this:
In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up auto-scaling to make sure our prediction API is always online and performing as expected.