scikit-learn_bench benchmarks various implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the scikit-learn, DAAL4PY, cuML, and XGBoost frameworks for commonly used machine learning algorithms.
See benchmark results here.
- Prerequisites
- How to create conda environment for benchmarking
- How to enable daal4py patching for scikit-learn benchmarks
- Running Python benchmarks with runner script
- Supported algorithms
- Algorithms parameters
- Legacy automatic building and running
python
andscikit-learn
to run python versions- pandas when using its DataFrame as input data format
icc
,ifort
,mkl
,daal
to compile and run native benchmarks- machine learning frameworks, that you want to test. Check this item to get additional information how to set environment.
Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
Set specific environment variable export FORCE_DAAL4PY_SKLEARN=YES
Run python runner.py --configs configs/config_example.json [--output-format json --verbose]
to launch benchmarks.
runner options:
configs
: configuration files pathsdummy-run
: run configuration parser and datasets generation without benchmarks runningverbose
: print additional information during benchmarks runningoutput-format
: json or csv. Output type of benchmarks to use with their runner
Benchmarks currently support the following frameworks:
- scikit-learn
- daal4py
- cuml
- xgboost
The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
You can configure benchmarks by editing a config file. Check config.json schema for more details.
algorithm | benchmark name | sklearn | daal4py | cuml | xgboost |
---|---|---|---|---|---|
DBSCAN | dbscan | ✅ | ✅ | ✅ | ❌ |
RandomForestClassifier | df_clfs | ✅ | ✅ | ✅ | ❌ |
RandomForestRegressor | df_regr | ✅ | ✅ | ✅ | ❌ |
pairwise_distances | distances | ✅ | ✅ | ❌ | ❌ |
KMeans | kmeans | ✅ | ✅ | ✅ | ❌ |
KNeighborsClassifier | knn_clsf | ✅ | ❌ | ✅ | ❌ |
LinearRegression | linear | ✅ | ✅ | ✅ | ❌ |
LogisticRegression | log_reg | ✅ | ✅ | ✅ | ❌ |
PCA | pca | ✅ | ✅ | ✅ | ❌ |
Ridge | ridge | ✅ | ✅ | ✅ | ❌ |
SVM | svm | ✅ | ✅ | ✅ | ❌ |
train_test_split | train_test_split | ✅ | ❌ | ✅ | ❌ |
GradientBoostingClassifier | gbt | ❌ | ❌ | ❌ | ✅ |
GradientBoostingRegressor | gbt | ❌ | ❌ | ❌ | ✅ |
You can launch benchmarks for each algorithm separately. To do this, go to the directory with the benchmark:
cd <framework>
Run the following command:
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
The list of supported parameters for each algorithm you can find here:
- Run
make
. This will generate data, compile benchmarks, and run them.- To run only scikit-learn benchmarks, use
make sklearn
. - To run only native benchmarks, use
make native
. - To run only daal4py benchmarks, use
make daal4py
. - To run a specific implementation of a specific benchmark,
directly request the corresponding file:
make output/<impl>/<bench>.out
. - If you have activated a conda environment, the build will use daal from the conda environment, if available.
- To run only scikit-learn benchmarks, use