-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Thanks a lot for a clear examples of using the SWIG APIs. I tried to use this library in the low latency environment and had to do a bunch of things to make it work faster. Some of them can be described in the documentation - it will save time for users. Others can be implemented in code. My task is to make predictions as fast as possible. Here are the benchmarks.
Benchmark Mode Cnt Score Error Units
LightGbmBenchmark.predictDefault avgt 5 909.075 ± 2637.741 us/op
LightGbmBenchmark.predictSingleThread avgt 5 12.242 ± 3.748 us/op
LightGbmBenchmark.predictSingleRow avgt 5 7.951 ± 0.341 us/op
LightGbmBenchmark.predictSingleRowNoAllocation avgt 5 6.969 ± 0.155 us/op
LightGbmBenchmark.predictSingleRowFast avgt 5 2.821 ± 1.061 us/op
LightGbmBenchmark.predictSingleRowUnsafe avgt 5 2.535 ± 0.110 us/op
The base case is predictDefault benchmark which calls predictForMat
function. We see that is works almost 1ms. Why is it? Because by default LightGBM tries to parallelize computations. It creates threads under the hood, executes jobs asynchronously, waits for results, etc. All of this doesn't make sense if our job is to make predictions. So I added the following argument: "num_threads=1 device=cpu". The result can be seen in predictSingleThread benchmark. It became only 12us. So, I suggest you to add the same argument by default when making predictions.
Then I found out that there is a method which predicts only a single row which works for 7.9us on my machine. Please mention it in the docs! 7.9us is good, but can be further improved.
Method predictForMatSingleRow allocates a bunch of off-heap structures under the hood, makes prediction, then frees these structures. If we reuse these structures, then we can save some time. The benchmark predictSingleRowNoAllocation works for 6.9us. I would suggest to overload predictForMatSingleRow method by extracting dataBuffer
, outBuffer
and outLength
to parameters.
Still we can do better. LightGBM allows to make preparations once, and then make many predictions by calling methods LGBM_BoosterPredictForMatSingleRowFastInit
and LGBM_BoosterPredictForMatSingleRowFast
respectively. lightgbm4j doesn't expose this functionality. However, if we make predictions this way, then we further reduce single call time to 2.8us. Async profiler shows that LightGBM does not call malloc/free in this case at all. Probably one should add these methods to the friendly API.
And finally, we can do even better. JNI is not the fastest way to pass data to the native memory. Since SWIG gives us raw addresses, we can read and write at these addresses using Unsafe. Or we can use a friendlier UnsafeBuffer from agrona. This saves us another 0.29us, which is 10% speed up at this point. Here is my repo with experiments https://github.com/vad0/lightgbm. If anyone is interested in doing some of the things I described, then I will be ready to help.