Skip to content

Performance improvements #81

@vad0

Description

@vad0

Thanks a lot for a clear examples of using the SWIG APIs. I tried to use this library in the low latency environment and had to do a bunch of things to make it work faster. Some of them can be described in the documentation - it will save time for users. Others can be implemented in code. My task is to make predictions as fast as possible. Here are the benchmarks.

Benchmark                                              Mode  Cnt    Score      Error  Units
LightGbmBenchmark.predictDefault                       avgt    5  909.075 ± 2637.741  us/op
LightGbmBenchmark.predictSingleThread                  avgt    5   12.242 ±    3.748  us/op

LightGbmBenchmark.predictSingleRow                     avgt    5    7.951 ±    0.341  us/op
LightGbmBenchmark.predictSingleRowNoAllocation         avgt    5    6.969 ±    0.155  us/op
LightGbmBenchmark.predictSingleRowFast                 avgt    5    2.821 ±    1.061  us/op
LightGbmBenchmark.predictSingleRowUnsafe               avgt    5    2.535 ±    0.110  us/op

The base case is predictDefault benchmark which calls predictForMat function. We see that is works almost 1ms. Why is it? Because by default LightGBM tries to parallelize computations. It creates threads under the hood, executes jobs asynchronously, waits for results, etc. All of this doesn't make sense if our job is to make predictions. So I added the following argument: "num_threads=1 device=cpu". The result can be seen in predictSingleThread benchmark. It became only 12us. So, I suggest you to add the same argument by default when making predictions.

Then I found out that there is a method which predicts only a single row which works for 7.9us on my machine. Please mention it in the docs! 7.9us is good, but can be further improved.

Method predictForMatSingleRow allocates a bunch of off-heap structures under the hood, makes prediction, then frees these structures. If we reuse these structures, then we can save some time. The benchmark predictSingleRowNoAllocation works for 6.9us. I would suggest to overload predictForMatSingleRow method by extracting dataBuffer, outBuffer and outLength to parameters.

Still we can do better. LightGBM allows to make preparations once, and then make many predictions by calling methods LGBM_BoosterPredictForMatSingleRowFastInit and LGBM_BoosterPredictForMatSingleRowFast respectively. lightgbm4j doesn't expose this functionality. However, if we make predictions this way, then we further reduce single call time to 2.8us. Async profiler shows that LightGBM does not call malloc/free in this case at all. Probably one should add these methods to the friendly API.

And finally, we can do even better. JNI is not the fastest way to pass data to the native memory. Since SWIG gives us raw addresses, we can read and write at these addresses using Unsafe. Or we can use a friendlier UnsafeBuffer from agrona. This saves us another 0.29us, which is 10% speed up at this point. Here is my repo with experiments https://github.com/vad0/lightgbm. If anyone is interested in doing some of the things I described, then I will be ready to help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions