DJL Serving can be divided into a frontend and backend. The frontend is a netty webserver that manages incoming requests and operates the control plane. The backend WorkLoadManager handles the model batching, workers, and threading for high-performance inference.
For those who already have a web server infrastructure but want to operate high-performance inference, it is possible to use only the WorkLoadManager. For this reason, we have it split apart into a separate module.
Using the WorkLoadManager is quite simple. First, create a new one through the constructor:
WorkLoadManager wlm = new WorkLoadManager();
You can also configure the WorkLoadManager by using the static WlmConfigManager
.
Then, you can construct a ModelInfo
for each model you will want to run through wlm
.
With the ModelInfo
, you are able to build a Job
once you receive input:
ModelInfo modelInfo = new ModelInfo(...);
Job job = new Job(modelInfo, input);
Once you have your job, it can be submitted to the WorkLoadManager.
It will automatically spin up workers if none are created and manage worker numbers.
Then, it returns a CompletableFuture<Output>
for the result.
CompletableFuture<Output> futureResult = wlm.runJob(job);
View the javadocs for the WorkLoadManager
for more options.
The latest javadocs can be found on the javadoc.io.
You can also build the latest javadocs locally using the following command:
# for Linux/macOS:
./gradlew javadoc
# for Windows:
..\..\gradlew javadoc
The javadocs output is built in the build/doc/javadoc
folder.
You can pull the server from the central Maven repository by including the following dependency:
<dependency>
<groupId>ai.djl.serving</groupId>
<artifactId>wlm</artifactId>
<version>0.27.0</version>
</dependency>