Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Converting traditional ML algorithms using Hummingbird and benchmark model performance. #123

Open
dhrubo-os opened this issue Mar 26, 2023 · 12 comments
Labels
CCI enhancement New feature or request

Comments

@dhrubo-os
Copy link
Collaborator

Currently for build in ML algorithms in opensearch we need to write that in Java, which is sometimes more time consuming due to not having enough ML support in Java

One initiative we started is, we can write algorithm in TorchScript, trace the torchScript file and then load the model file in Opensearch using MLCommon's Model serving framework.

One bottleneck is, in torchScript we can't import any 3rd party library like scikit-learn so to include scikit-learn models in Opensearch we have to rewrite the algorithm in torchscrip which can't be the ideal solution.

To solve this problem we can use Hummingbird through which we can convert traditional machine learning algorithm to neural network based algorithm for faster execution and at the same time we should be able to convert the algorithm to torchScript or Onnx so that we can load the model in opensearch.

In this issue, we would like to investigate if humming bird will solve our issue or not.

The following steps can be done in the investigation

  1. Import Hummingbird in the py-ml repo
  2. convert a model in torchScript (We can start with simple stateless models like PCA/KernelPCA
  3. Run both format of the algorithm (original scikit learn algorithm and converted pytorch and onnx algorithm)to compare the output.
  4. We can have bench-marking to compare performance between all three formats of model execution.
@dhrubo-os dhrubo-os added enhancement New feature or request CCI labels Mar 26, 2023
@AlibiZhenis
Copy link
Contributor

I'd like to work on this

@dhrubo-os
Copy link
Collaborator Author

Sure, please go ahead.

@AlibiZhenis
Copy link
Contributor

So I played around with it in this notebook: https://www.kaggle.com/code/alibizhenis/hummingbird
I trained and converted Random Forest, SVC, and KNN classifiers using sklearn and converted them to torch.
Points to note:

  • The performances of original and converted models were identical
  • I couldn't convert them to onnx. I kept getting "backend not supported" error. I suspect that the conversion of these specific models to onnx is not supported, rather than in general.
  • For KNN (and maybe some other models), the conversion wasn't smooth. The normal convert method didn't work, convert_batch needed to be used. The input shape of the converted model was limited to test_input.shape[0] * k + remainder_size, where test_input and remainder_size are parameters of the convert_batch method and k is any integer. Therefore, if we want to use the converted model with any number of samples, test_input always needs to be only 1 sample. I imagine there would be more such nuances with other models.

@AlibiZhenis
Copy link
Contributor

I compared three transformative models here: https://www.kaggle.com/code/alibizhenis/hummingbird-pca

  • I couldn't convert them to onnx again, but converted all of them to torchscript
  • All three pairs of models produce datasets with pairwise elements that are close up to 4th decimal.

@dhrubo-os
Copy link
Collaborator Author

I compared three transformative models here: https://www.kaggle.com/code/alibizhenis/hummingbird-pca

* I couldn't convert them to onnx again, but converted all of them to torchscript

* All three pairs of models produce datasets with pairwise elements that are close up to 4th decimal.

Thanks for your experiment. Can you also please try to run same algorithms with sklearn-onnx and compare the outputs on the same dataset as well?

@AlibiZhenis
Copy link
Contributor

AlibiZhenis commented Apr 2, 2023

I tested them in the same notebook with skl2onnx. Results of PCA and KernelPCA matched, while converted TruncatedSVD model produced completely different result for some reason.

@dhrubo-os
Copy link
Collaborator Author

Thanks for the investigation. Can we also try to find any ARIMA model to convert to torchScript or ONNX? Mainly we will be interested to have any forecasting model converted into torchscript or onnx?

After doing that I would like you to wrap up all your investigations in this package's experiment branch.

  1. You will add a notebook for PCA releated experiments for original, torchscript and onnx to have a side by side comparison.
  2. Same another notebook for other algorithms with the comparison of original, torchscript and onnx
  3. If you progress with any forecasting model, then please add another notebook for that as well.

Thanks for your hard work.

@AlibiZhenis
Copy link
Contributor

Which framework would you like me to use for time series models? Because I don't think sklearn supports any. There is a statsmodels package, but it's not supported by hummingbird.

@dhrubo-os
Copy link
Collaborator Author

Yeah agree. This is the part we need some investigation also.

@AlibiZhenis
Copy link
Contributor

I added the notebooks for PCA and classification.

Upon further research on time series forecasting, I concluded the following:

  • Hummingbird doesn't support any time series models.
  • There are some great time series packages, like statsmodels, pmdarima, sktime, and etc. But I haven't found any ways to convert their models to onnx or torchscript.

@AlibiZhenis
Copy link
Contributor

Upon further research, I couldn't find ways to convert models from popular time series packages like statsmodels. Nonetheless, I found ways to use some models in torchscript and onnx (mostly deep learning):

  • HuggingFace's Time Series Transformer
  • NVIDIA's TSPP, which supports their own TFT model, XGBoost, AutoARIMA, and LSTM models. The platform provides a way to convert these models to torch and onnx.

@dhrubo-os dhrubo-os changed the title [FEATURE] Convering traditional ML algorithms using Hummingbird and benchmark model performance. [FEATURE] Converting traditional ML algorithms using Hummingbird and benchmark model performance. Apr 27, 2023
@dhrubo-os
Copy link
Collaborator Author

Can we perform following experiments:

  1. Try these models for forecasting with a time series dataset.
  2. Convert the models in torchscript/onnx and perform forecasting with the same dataset
  3. Use any traditional forecasting model (arima works) for forecasting with the same dataset.
  4. Then compare the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CCI enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants