The mlserver
package comes with inference runtime implementations for scikit-learn
and xgboost
models.
However, some times we may also need to roll out our own inference server, with custom logic to perform inference.
To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we create a simple Hello World JSON
model that parses and modifies a JSON data chunk. This is often useful as a means to
quickly bootstrap existing models that utilize JSON based model inputs.
The next step will be to serve our model using mlserver
.
For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON
model.
This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:
- Parse the JSON input from the client
- Create a JSON response echoing back the client request as well as a server generated message
%%writefile jsonmodels.py
import json
from typing import Dict, Any
from mlserver import MLModel, types
from mlserver.codecs import StringCodec
class JsonHelloWorldModel(MLModel):
async def load(self) -> bool:
# Perform additional custom initialization here.
print("Initialize model")
# Set readiness flag for model
return await super().load()
async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
request = self._extract_json(payload)
response = {
"request": request,
"server_response": "Got your request. Hello from the server.",
}
response_bytes = json.dumps(response).encode("UTF-8")
return types.InferenceResponse(
id=payload.id,
model_name=self.name,
model_version=self.version,
outputs=[
types.ResponseOutput(
name="echo_response",
shape=[len(response_bytes)],
datatype="BYTES",
data=[response_bytes],
parameters=types.Parameters(content_type="str"),
)
],
)
def _extract_json(self, payload: types.InferenceRequest) -> Dict[str, Any]:
inputs = {}
for inp in payload.inputs:
inputs[inp.name] = json.loads(
"".join(self.decode(inp, default_codec=StringCodec))
)
return inputs
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
%%writefile settings.json
{
"debug": "true"
}
%%writefile model-settings.json
{
"name": "json-hello-world",
"implementation": "jsonmodels.JsonHelloWorldModel"
}
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
mlserver start .
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
.
To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
import requests
import json
from mlserver.types import InferenceResponse
from mlserver.codecs.string import StringRequestCodec
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=1)
inputs = {"name": "Foo Bar", "message": "Hello from Client (REST)!"}
# NOTE: this uses characters rather than encoded bytes. It is recommended that you use the `mlserver` types to assist in the correct encoding.
inputs_string = json.dumps(inputs)
inference_request = {
"inputs": [
{
"name": "echo_request",
"shape": [len(inputs_string)],
"datatype": "BYTES",
"data": [inputs_string],
}
]
}
endpoint = "http://localhost:8080/v2/models/json-hello-world/infer"
response = requests.post(endpoint, json=inference_request)
print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = InferenceResponse.parse_raw(response.text)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)
Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.
For simplicity in this case, we leverage the Python types that mlserver
provides out of the box. Alternatively,
the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as
Python clients.
import requests
import json
import grpc
from mlserver.codecs.string import StringRequestCodec
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
import mlserver.types as types
from pprint import PrettyPrinter
pp = PrettyPrinter(indent=1)
model_name = "json-hello-world"
inputs = {"name": "Foo Bar", "message": "Hello from Client (gRPC)!"}
inputs_bytes = json.dumps(inputs).encode("UTF-8")
inference_request = types.InferenceRequest(
inputs=[
types.RequestInput(
name="echo_request",
shape=[len(inputs_bytes)],
datatype="BYTES",
data=[inputs_bytes],
parameters=types.Parameters(content_type="str"),
)
]
)
inference_request_g = converters.ModelInferRequestConverter.from_types(
inference_request, model_name=model_name, model_version=None
)
grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)
print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = converters.ModelInferResponseConverter.to_types(response)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)