Description
I'm using the LGBMRegressor
as part of a Scikit-learn API. I'm having issues in that I have some models that give me different results when calling .predict()
in my Docker environment on my local Mac machine and in the same Docker environment on an AWS EC2 instance. This is despite the model using deterministic=True
, force_row_wise=True
and num_threads=1
.
First question. Is this expected that even with these flags set that results might be different on different machines? Under the deterministic section of the docs, I see the following bullet point:
when you use the different seeds, different LightGBM versions, the binaries compiled by different compilers, or in different systems, the results are expected to be different
This makes it seem like maybe this is expected behavior, although I might have hoped that running in a Docker environment would allow for reproducible behavior. The problem of course is that, as I'm creating tests for my code base, I can't guarantee that the tests will pass in CI/CD if they pass locally on my computer or elsewhere. If this expected behavior, how are people including LGBM code in their test suites which don't run on the same hardware?
If this is not expected behavior, is there a data or model setup that would maybe not be covered by the flags being set in this way? Prior to the LGBMRegressor, I have a data transformation pipeline that makes various data transformations. Purely by guessing and checking, I figured out that by removing a CyclicalFeatures
(https://feature-engine.trainindata.com/en/1.7.x/api_doc/creation/CyclicalFeatures.html#feature_engine.creation.CyclicalFeatures) transformation on the pipeline gave me reproducible results between my local machine and the EC2 box. This transformation isn't doing anything stochastic, but it simply transforming a feature into sine and cosine representations. Is there a reason why mapping a feature to the -1 to 1 range would introduce a behavior that would be non-deterministic?
I have a minimal example which includes data, a saved pipeline, and a driver script. If useful, I could relabel the data to remove any sensitive information and provide it, provide a minimal working Docker environment, etc., but just wanted to ask the above questions first.
Thanks.