Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Use scikit-learn interpretation of negative n_jobs and change default to number of cores #5105

Merged
merged 33 commits into from
Jun 19, 2022

Conversation

david-cortes
Copy link
Contributor

The scikit-learn interface for lightgbm has a parameter n_jobs with a default value of -1, which is interpreted as using the default number of OMP threads.

According to the scikit-learn glossary, a negative n_jobs has a different meaning (uses joblib's formula), so -1 means to use all available threads:
https://scikit-learn.org/stable/glossary.html#term-n_jobs

This PR changes the meaning of negative n_jobs to match that of scikit-learn, which I think is how most people would expect scikit-learn compatible libraries to behave.

I'm not sure what kind of requirements are there for the tests that are run on Python. This change could be tested, but it would imply fitting models using multiple threads per function call - not sure if there's any issue with that like in the R interface.

@StrikerRUS
Copy link
Collaborator

@david-cortes Thanks a lot for this PR! I support changing default value of n_jobs to match scikit-learn behavior.

I'm not sure what kind of requirements are there for the tests that are run on Python.

I think we can at least check transformed value of nthreads that reaches cpp side. Refer to #4972 (comment).

@david-cortes
Copy link
Contributor Author

@StrikerRUS What would you think of changing the default to the number of physical cores instead?

@StrikerRUS
Copy link
Collaborator

What would you think of changing the default to the number of physical cores instead?

TBH, I'm not sure. I see some inconsistency. scikit-learn in their glossary says about CPUs (e.g. "If set to -1, all CPUs are used") but under the hood joblib detects number of threads (according to default only_physical_cores=False)
https://github.com/scikit-learn/scikit-learn/blob/582fa30a31ffd1d2afc6325ec3506418e35b88c2/sklearn/ensemble/_forest.py#L444-L454

@david-cortes
Copy link
Contributor Author

I'll BTW point out that the linter is complaining about the order of the imports. According to the docs here, the python code should be PEP8-compliant save for a few exceptions. What this linter is complaining about is outside the scope of PEP8 and not mentioned in the docs. I'm also not sure what would be the right order of imports according to the linter (the logs don't mention which linter is complaining about it).

@david-cortes
Copy link
Contributor Author

david-cortes commented Mar 30, 2022

What would you think of changing the default to the number of physical cores instead?

TBH, I'm not sure. I see some inconsistency. scikit-learn in their glossary says about CPUs (e.g. "If set to -1, all CPUs are used") but under the hood joblib detects number of threads (according to default only_physical_cores=False) https://github.com/scikit-learn/scikit-learn/blob/582fa30a31ffd1d2afc6325ec3506418e35b88c2/sklearn/ensemble/_forest.py#L444-L454

I mean, what would you think about setting the default value to the output returned from joblib.cpu_count(only_physical_cores=True) (while still interpreting negative numbers the same way as scikit-learn).

@jmoralez
Copy link
Collaborator

I'll BTW point out that the linter is complaining about the order of the imports.

That refers to what is defined here.

Imports should be grouped in the following order:
Standard library imports.
Related third party imports.
Local application/library specific imports.

So the joblib import should go just above numpy

@StrikerRUS
Copy link
Collaborator

I mean, what would you think about setting the default value to the output returned from joblib.cpu_count(only_physical_cores=True) (while still interpreting negative numbers the same way as scikit-learn).

According to the scikit-learn docs and our own recommendations, we should set only_physical_cores=True. But looks like scikit-learn under the hood uses only_physical_cores=False, so we won't match their behavior ideally.

I'm for using only_physical_cores=True in our package.

@david-cortes david-cortes changed the title [Python] Use scikit-learn interpretation of negative n_jobs [Python] Use scikit-learn interpretation of negative n_jobs and change default to number of cores Mar 30, 2022
@david-cortes
Copy link
Contributor Author

Updated.

@david-cortes
Copy link
Contributor Author

Well, it seems joblib's CPU count function wasn't very robust after all.

@david-cortes
Copy link
Contributor Author

Looks like the failing tests are because of the dask interface having different default arguments from the scikit-learn interface. Not familiar with dask so I'll ask: should the dask interface also default to the same n_jobs and have the same interpretation of negative n_jobs?

python-package/lightgbm/compat.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
python-package/lightgbm/sklearn.py Outdated Show resolved Hide resolved
david-cortes and others added 5 commits May 15, 2022 22:54
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS
Copy link
Collaborator

But I can't imagine any other sources of inconsistency between standard and sklearn interfaces in this PR.

Now that Linux jobs at Azure Pipelines fail always but Windows jobs fail irregularly (32b73a2: all successful; a31fced: Windows regular; 21cd3b2: Windows sdist) I almost sure that the reason is in randomness caused by different number of threads.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented May 29, 2022

Well, the issue of non-deterministic results of the linear tree model given different number of threads isn't related to this PR anyway.

@david-cortes Let's unblock this PR by specifying LightGBM default number of threads (0) here, because in config file that param is commented out:

gbm = lgb.LGBMClassifier(**fd.params)

gbm = lgb.LGBMClassifier(**fd.params, n_jobs=0)

@david-cortes
Copy link
Contributor Author

david-cortes commented May 30, 2022

Well, the issue of non-deterministic results of the linear tree model given different number of threads isn't related to this PR anyway.

@david-cortes Let's unblock this PR by specifying LightGBM default number of threads (0) here, because in config file that param is commented out:

gbm = lgb.LGBMClassifier(**fd.params)

gbm = lgb.LGBMClassifier(**fd.params, n_jobs=0)

Updated, but I think you should open an issue so as not to forget about those differences. There might be some big bug underneath.

@StrikerRUS
Copy link
Collaborator

Updated, but I think you should open an issue so as not to forget about those differences. There might be some big bug underneath.

Thanks a lot! Sure, will open a separate issue for investigating this.

@StrikerRUS
Copy link
Collaborator

@jmoralez @jameslamb Will really appreciate your review for this PR.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the hard work on this PR!
LGTM except one minor simplifying the codebase suggestion and using temp folder in tests.

Comment on lines +854 to +858
n_jobs = self.n_jobs
for alias in _ConfigAliases.get("num_threads"):
if alias in predict_params:
n_jobs = predict_params.pop(alias)
predict_params["num_threads"] = self._process_n_jobs(n_jobs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified with _choose_param_value():

Suggested change
n_jobs = self.n_jobs
for alias in _ConfigAliases.get("num_threads"):
if alias in predict_params:
n_jobs = predict_params.pop(alias)
predict_params["num_threads"] = self._process_n_jobs(n_jobs)
predict_params = _choose_param_value("num_threads", predict_params, self.n_jobs)
predict_params["num_threads"] = self._process_n_jobs(predict_params["num_threads"])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is, _choose_param_value treats None as if it were not passed, whereas after this PR, None has a special value. For example, one can pass n_jobs=2 in the constructor, then n_jobs=None in the call to predict, and if that function were used, it would not have the desired effect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now! Thanks for the explanation!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Maybe we can refactor _choose_param_value() function so that it will handle None value of a param correctly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah, I think I understand. Will put up a PR shortly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #5289

tests/python_package_test/test_sklearn.py Outdated Show resolved Hide resolved
tests/python_package_test/test_sklearn.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

Sure, will open a separate issue for investigating this.

Created: #5266

david-cortes and others added 2 commits June 5, 2022 14:41
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jun 12, 2022

Kindly ping @jameslamb and @jmoralez for #5105 (comment).

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me! I have no additional comments, and it's ok with me if we merge this and then later come back and apply the suggestion from https://github.com/microsoft/LightGBM/pull/5105/files#r889631120 after #5289 is merged.

@jameslamb jameslamb changed the title [Python] Use scikit-learn interpretation of negative n_jobs and change default to number of cores [python-package] Use scikit-learn interpretation of negative n_jobs and change default to number of cores Jun 14, 2022
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants