Minimal library code to deploy XGBoost models in C++.
In science, it is very common to protoype algorithms with Python and then put them in production with fast C++ code. Transitioning models from Python to C++ should be as easy as possible to make sure new ideas can be tried out rapidly. The FastForest library helps you to get your xgboost model into a C++ production environment as quickly as possible.
The mission of this library is to be:
- Easy: deploying your xgboost model should be as painless as it can be
- Fast: thanks to efficient data structures for storing the trees, this library goes easy on your CPU and memory
- Safe: the FastForest objects are immutable, and therefore they are an excellent choice in multithreading environments
- Portable: FastForest has no dependency other than the C++ standard library
You can clone this repository, compile and install the library with cmake:
git clone git@github.com:guitargeek/FastForest.git
mkdir build
cd build
cmake ..
make
sudo make install
Usually, xgboost models are trained via the scikit-learn interface, like in this example with a random toy dataset. At the end, we save the model both in binary format to be able to still read it with xgboost, as well as in text format so we can open it with FastForest.
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])
model = XGBClassifier().fit(X, y)
booster = model._Booster
booster.dump_model("model.txt")
booster.save_model("model.bin")
In C++, you can now easily load the model into a FastForest
and obtain predictions by calling the FastForest object with an array of features.
#include "fastforest.h"
#include <cmath>
int main() {
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
FastForest fastForest("model.txt", features);
std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
float score = 1./(1. + std::exp(-fastForest(input.data())))
}
Some things to keep in mind:
- You need to pass the names of the features that you will later use for the prediction to the FastForest constructor. This is necessary because the features are not ordered in the text file, hence you need to define an order yourself.
- Alternatively, can let the FastForest automatically determine an order by just passing an empty vector of strings. You will see the vector is filled with automatically determined feature names afterwards.
- The original order of the features used in the training can't be recovered.
- The FastForest does not apply the logistic transformation.
This is intentional, so you will not have any precision loss when you need the untransformed output. Thereforey ou need to apply
the logistic transformation manually if you trained with
objective='binary:logistic'
and want to reproduce the results ofpredict_proba()
, like in the code snippet above.- If you train with the
objective='binary:logitraw'
parameter, the output you'll get frompredict_proba()
will be without the logistic transformation, just like from the FastForest.
- If you train with the
So far, FastForest has been bencharked against the inference engine in the xgboost python library (undelying C) and the TMVA framework. For every engine, the same tree ensemble of 1000 trees is used, and inference is done on a single thread.
Engine | Benchmark time |
---|---|
FastForest (g++ (GCC) 9.1.0) | 0.58 s |
m2cgen | 1.3 s |
xgboost 0.82 in Python 3.7.3 | 2.6 s |
ROOT 6.16/00 TMVA | 3.8 s |
The benchmak can be reproduced with the files found in the benchmark directory. The python scripts have to be
run first as they also train and save the models. Input type from the code generated by m2cgen was changed from
double
to float
for a better comparison with FastForest.
The tests were performed on a Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.
The FastForests can serialized to it's own binary format. The binary format exactly reflects the memory layout of the
FastForest class, so saving and loading is as fast as it can be. The serialization to file is done with the save
method.
fastForest.save("forest.bin");
The serialized FastForest can be read back with it's constructor, this time the one that does not take a reference to a vector for the feature names.
FastForest fastForest("forest.bin");