-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Speed-up lgb.importance() #6364
Conversation
Excellent, thank you for the great work and thorough write-up!!! Many of the CI failures look unrelated to these changes, I will try to help with them (and provide a review here) soon. |
I've merged the latest state of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great improvement, and your thorough description made it easy to review. Thank you for this!
The one thing that concerns me... this project has very very minimal coverage of this code path in tests. Just two incidental uses of lgb.model.dt.tree()
in other tests:
LightGBM/R-package/tests/testthat/test_Predictor.R
Lines 274 to 277 in cb4972e
trees_dt <- lgb.model.dt.tree(bst) | |
max_leaf_by_tree_from_dt <- trees_dt[, .(idx = max(leaf_index, na.rm = TRUE)), by = tree_index]$idx | |
max_leaf_by_tree_from_preds <- apply(preds_leaf_s3_keyword, 2L, max, na.rm = TRUE) | |
expect_equal(max_leaf_by_tree_from_dt, max_leaf_by_tree_from_preds) |
And minimal, incidental coverage of lgb.importance()
.
LightGBM/R-package/tests/testthat/test_parameters.R
Lines 30 to 44 in cb4972e
var_gain <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Gain]) | |
var_cover <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Cover]) | |
var_freq <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Frequency]) | |
# Ensure that feature gain, cover, and frequency decreases with stronger penalties | |
expect_true(all(diff(unlist(var_gain)) <= 0.0)) | |
expect_true(all(diff(unlist(var_cover)) <= 0.0)) | |
expect_true(all(diff(unlist(var_freq)) <= 0.0)) | |
expect_lt(min(diff(unlist(var_gain))), 0.0) | |
expect_lt(min(diff(unlist(var_cover))), 0.0) | |
expect_lt(min(diff(unlist(var_freq))), 0.0) | |
# Ensure that feature is not used when feature_penalty = 0 | |
expect_length(var_gain[[length(var_gain)]], 0L) |
tree_imp <- lgb.importance(model, percentage = TRUE) |
Since you've been looking at this function closely, could you add some tests? A new file tests/testthat/test_lgb.model.dt.tree.R
, with at least one test for regression, binary classification, multiclass classification, and ranking checking things like the following in the output of lgb.model.dt.tree()
:
- expected number of rows
- expected ranges of values in columns
- expected uniqueness of values in columns
- anything else you can think of
And maybe some similar tests on the output of lgb.importance()
in tests/testthat/test_lgb.importance.R
?
If that sounds like too much please do let me know, and I could add such tests in separate PRs. I want to be respectful of your time.
Co-authored-by: James Lamb <jaylamb20@gmail.com>
Co-authored-by: James Lamb <jaylamb20@gmail.com>
A couple of unit tests would certainly make sense. I will try to add some. |
@jameslamb Only partly what you suggested: regression only, and no checks for |
@mayer79 the tests you added look great, thanks! Do you think you'll have time to add more in the next few days? If not, I think we should merge this as-is and then just document the additional testing work in a separate issue. |
Great! I will add a loop over some more models fitted with different objectives. Maybe we can add tests for |
@jameslamb added a test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! I really appreciate these thorough and thoughtful tests.
Totally fine with me to defer the lgb.importance()
tests until later, and I'd happily leave them for you to do when you have time and interest.
I left a few comments on the latest round of changes for your consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, really appreciate all the thorough tests! Thank you!
Thanks for your help and patience @mayer79 ! |
lgb.importance()
is relatively slow, partly due todata.table::rbindlist()
This PR attacks 2. by modifying the
lgb.model.dt.tree()
workhorse as follows: instead of rbinding every single node, the node information is added to a (growing) list. At the end, a single call torbindlist()
combines the list elements.Furthermore, I have re-formatted some extemely "wide" code lines.
Speed-up
This demo with 500 trees and 50k training data reduces the time by more than 50%. The results are unchanged.
Profiler
In above example, after the change, about 2/3 of the time is spent in json parsing. Thus, the calculation part is reduced by a factor of about 6.
@jameslamb @david-cortes