Skip to content

csvance/KServeClient.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KServeClient.jl

A Julia KServe client for ML model inference over gRPC. Supports any implementation of the official KServe protocol including:

Install

using Pkg
Pkg.add("KServeClient")

Basic Usage

For this example, we are going to call an image classification model on nVidia Triton with the following config.pbtxt:

name: "example_cnn_classifier"
platform: "pytorch_libtorch"
max_batch_size: 1
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

To run inference against this model, setup a connection pool or client, define your model inputs, call the model, and choose which outputs to extract.

using KServeClient

# Create the pool with a single connection, effectively the same thing as not using a pool
kscp = KServeClientPool(1, "https://my-grpc-server:8001")

# Define the inputs using native Julia types
input__0 = InferInput("INPUT__0", zeros(Float32, 1, 224, 224))

# Call inference (blocking)
response = ModelInfer(kscp, "example_cnn_classifier", [input__0])

# Get the output
output__0 = InferOutput(
    "OUTPUT__0",
    response
)

@assert size(output__0) == (1, 1000)
@assert eltype(output__0) == Float32

In order to achieve maximum throughput you will need to use concurrency. Currently there is an upstream issue where using non-pinning concurrency where results in connections being dropped, so use @async until we can get it fixed. You can use a @spawn one level above the async to prevent the parent task from being pinned to a single Julia thread if you are using threading.

using KServeClient

# This time create a connection pool with 8 connections to take advantage of async
kscp = KServeClientPool(8, "https://my-grpc-server:8001")

N = 256
inp = zeros(Float32, N, 224, 224)
@sync begin
# The @spawn here avoids pinning the parent thread to a single task
@spawn begin
    for i in 1:N
        let i=i
            @async begin
                input__0 = InferInput("INPUT__0", inp[i:i, :, :])
                response = ModelInfer(kscp, "example_cnn_classifier", [input__0])
                output__0 = InferOutput(
                    "OUTPUT__0",
                    response
                )
                # Do something with the output
            end
        end
    end
end
end

Known Issues

curl_multi_socket_action: 8 deadlock

This is caused by a bug in Downloads.jl: Issue

See the open pull request for a workaround until it is merged.

HTTP/2 Multiplexing

Currently its not recomended to use one client for multiple requests as things seem to randomly hang or get into a bad state. An upstream fix for this is planned.

Non @async concurrency

There is currently some instability using threads, ie non-pinning concurrency with a number of Julia threads > 1. For now just use @async but an upstream fix for this is planned.

Development Sponsored by Medical Metrics Inc.

Sponsored by Medical Metrics Inc.

An experienced, full service, global imaging services and solutions company, Medical Metrics, Inc. (MMI) delivers independent, high-quality image analysis you can trust.

About

Julia KServe / nVidia Triton Client

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages