A Julia KServe client for ML model inference over gRPC. Supports any implementation of the official KServe protocol including:
using Pkg
Pkg.add("KServeClient")
For this example, we are going to call an image classification model on nVidia Triton with the following config.pbtxt:
name: "example_cnn_classifier"
platform: "pytorch_libtorch"
max_batch_size: 1
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 224, 224 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
To run inference against this model, setup a connection pool or client, define your model inputs, call the model, and choose which outputs to extract.
using KServeClient
# Create the pool with a single connection, effectively the same thing as not using a pool
kscp = KServeClientPool(1, "https://my-grpc-server:8001")
# Define the inputs using native Julia types
input__0 = InferInput("INPUT__0", zeros(Float32, 1, 224, 224))
# Call inference (blocking)
response = ModelInfer(kscp, "example_cnn_classifier", [input__0])
# Get the output
output__0 = InferOutput(
"OUTPUT__0",
response
)
@assert size(output__0) == (1, 1000)
@assert eltype(output__0) == Float32
In order to achieve maximum throughput you will need to use concurrency. Currently there is an upstream issue where using non-pinning concurrency where results in connections being dropped, so use @async until we can get it fixed. You can use a @spawn one level above the async to prevent the parent task from being pinned to a single Julia thread if you are using threading.
using KServeClient
# This time create a connection pool with 8 connections to take advantage of async
kscp = KServeClientPool(8, "https://my-grpc-server:8001")
N = 256
inp = zeros(Float32, N, 224, 224)
@sync begin
# The @spawn here avoids pinning the parent thread to a single task
@spawn begin
for i in 1:N
let i=i
@async begin
input__0 = InferInput("INPUT__0", inp[i:i, :, :])
response = ModelInfer(kscp, "example_cnn_classifier", [input__0])
output__0 = InferOutput(
"OUTPUT__0",
response
)
# Do something with the output
end
end
end
end
end
This is caused by a bug in Downloads.jl: Issue
See the open pull request for a workaround until it is merged.
Currently its not recomended to use one client for multiple requests as things seem to randomly hang or get into a bad state. An upstream fix for this is planned.
There is currently some instability using threads, ie non-pinning concurrency with a number of Julia threads > 1. For now just use @async but an upstream fix for this is planned.
An experienced, full service, global imaging services and solutions company, Medical Metrics, Inc. (MMI) delivers independent, high-quality image analysis you can trust.