[R-package] Speed-up lgb.importance() #6364

mayer79 · 2024-03-15T16:50:41Z

lgb.importance() is relatively slow, partly due to

the json writing/reading, and
many calls to data.table::rbindlist()

This PR attacks 2. by modifying the lgb.model.dt.tree() workhorse as follows: instead of rbinding every single node, the node information is added to a (growing) list. At the end, a single call to rbindlist() combines the list elements.

Furthermore, I have re-formatted some extemely "wide" code lines.

Speed-up

This demo with 500 trees and 50k training data reduces the time by more than 50%. The results are unchanged.

library(lightgbm)
library(ggplot2)
library(bench)
library(profvis)

params <- list(
  objective = "regression"
  , metric = "l2"
  , learning_rate = 0.05
)
model <- lgb.train(
  params = params
  , data = lgb.Dataset(data.matrix(diamonds[1:4]), label = diamonds$price)
  , nrounds = 500
)

system.time(  # 5.5s -> 2.0s
  imp <- lgb.importance(model)
)
imp

# Feature        Gain      Cover Frequency
# <char>       <num>      <num>     <num>
# 1:   carat 0.893368734 0.75091164 0.5106000
# 2: clarity 0.068601335 0.10038648 0.1738000
# 3:   color 0.034605234 0.09241368 0.1857333
# 4:     cut 0.003424697 0.05628819 0.1298667

profvis(lgb.importance(model))

Profiler

In above example, after the change, about 2/3 of the time is spent in json parsing. Thus, the calculation part is reduced by a factor of about 6.

@jameslamb @david-cortes

jameslamb · 2024-03-16T17:09:42Z

Excellent, thank you for the great work and thorough write-up!!!

Many of the CI failures look unrelated to these changes, I will try to help with them (and provide a review here) soon.

jameslamb · 2024-03-18T02:34:02Z

I've merged the latest state of master into this. Hopefully the changes from #6357 should fix all the CI failures that were blocking this.

jameslamb

Looks like a great improvement, and your thorough description made it easy to review. Thank you for this!

The one thing that concerns me... this project has very very minimal coverage of this code path in tests. Just two incidental uses of lgb.model.dt.tree() in other tests:

LightGBM/R-package/tests/testthat/test_Predictor.R

Lines 274 to 277 in cb4972e

    
           trees_dt <- lgb.model.dt.tree(bst) 
        
           max_leaf_by_tree_from_dt <- trees_dt[, .(idx = max(leaf_index, na.rm = TRUE)), by = tree_index]$idx 
        
           max_leaf_by_tree_from_preds <- apply(preds_leaf_s3_keyword, 2L, max, na.rm = TRUE) 
        
           expect_equal(max_leaf_by_tree_from_dt, max_leaf_by_tree_from_preds)

https://github.com/microsoft/LightGBM/blob/cb4972eeefe83d1558fb3c855ea25097e570ef7f/R-package/tests/testthat/test_basic.R#L1860-LL1863

And minimal, incidental coverage of lgb.importance().

https://github.com/microsoft/LightGBM/blob/cb4972eeefe83d1558fb3c855ea25097e570ef7f/R-package/tests/testthat/test_basic.R#L1860-LL1863

LightGBM/R-package/tests/testthat/test_parameters.R

Lines 30 to 44 in cb4972e

    
           var_gain <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Gain]) 
        
           var_cover <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Cover]) 
        
           var_freq <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Frequency]) 
        
           # Ensure that feature gain, cover, and frequency decreases with stronger penalties 
        
           expect_true(all(diff(unlist(var_gain)) <= 0.0)) 
        
           expect_true(all(diff(unlist(var_cover)) <= 0.0)) 
        
           expect_true(all(diff(unlist(var_freq)) <= 0.0)) 
        
           expect_lt(min(diff(unlist(var_gain))), 0.0) 
        
           expect_lt(min(diff(unlist(var_cover))), 0.0) 
        
           expect_lt(min(diff(unlist(var_freq))), 0.0) 
        
           # Ensure that feature is not used when feature_penalty = 0 
        
           expect_length(var_gain[[length(var_gain)]], 0L)

LightGBM/R-package/tests/testthat/test_lgb.plot.importance.R

Line 16 in cb4972e

tree_imp <- lgb.importance(model, percentage = TRUE)

Since you've been looking at this function closely, could you add some tests? A new file tests/testthat/test_lgb.model.dt.tree.R, with at least one test for regression, binary classification, multiclass classification, and ranking checking things like the following in the output of lgb.model.dt.tree():

expected number of rows
expected ranges of values in columns
expected uniqueness of values in columns
anything else you can think of

And maybe some similar tests on the output of lgb.importance() in tests/testthat/test_lgb.importance.R?

If that sounds like too much please do let me know, and I could add such tests in separate PRs. I want to be respectful of your time.

R-package/R/lgb.model.dt.tree.R

Co-authored-by: James Lamb <jaylamb20@gmail.com>

mayer79 · 2024-03-18T08:19:43Z

A couple of unit tests would certainly make sense. I will try to add some.

mayer79 · 2024-03-18T20:46:58Z

@jameslamb Only partly what you suggested: regression only, and no checks for lgb.importance(). I can add more tests later.

jameslamb · 2024-03-19T04:02:40Z

@mayer79 the tests you added look great, thanks!

Do you think you'll have time to add more in the next few days? If not, I think we should merge this as-is and then just document the additional testing work in a separate issue.

mayer79 · 2024-03-19T12:28:53Z

Great! I will add a loop over some more models fitted with different objectives.

Maybe we can add tests for lgb.importance() in a later PR? I can help with these too (next week).

mayer79 · 2024-03-19T19:54:40Z

@jameslamb added a test for num_iterations and a loop over three model types.

jameslamb

Thanks so much! I really appreciate these thorough and thoughtful tests.

Totally fine with me to defer the lgb.importance() tests until later, and I'd happily leave them for you to do when you have time and interest.

I left a few comments on the latest round of changes for your consideration.

R-package/tests/testthat/test_lgb.model.dt.tree.R

Co-authored-by: James Lamb <jaylamb20@gmail.com>

jameslamb

Looks great to me, really appreciate all the thorough tests! Thank you!

jameslamb · 2024-04-10T19:31:35Z

Thanks for your help and patience @mayer79 !

Speed-up lgb.importance()

8e14370

mayer79 requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners March 15, 2024 16:50

mayer79 added 2 commits March 15, 2024 17:55

Linter

f6da080

linter again

037d3c6

jameslamb changed the title ~~[R] Speed-up lgb.importance()~~ [R-package] Speed-up lgb.importance() Mar 16, 2024

jameslamb added the efficiency label Mar 16, 2024

Merge branch 'master' into r-speedup-importance

7c120b1

jameslamb requested changes Mar 18, 2024

View reviewed changes

R-package/R/lgb.model.dt.tree.R Outdated Show resolved Hide resolved

R-package/R/lgb.model.dt.tree.R Outdated Show resolved Hide resolved

mayer79 and others added 2 commits March 18, 2024 07:48

better name for long_list

4fd2bd8

Co-authored-by: James Lamb <jaylamb20@gmail.com>

Update R-package/R/lgb.model.dt.tree.R

b1a9d41

Co-authored-by: James Lamb <jaylamb20@gmail.com>

mayer79 added 2 commits March 18, 2024 21:35

Unit tests for lgb.model.dt.tree()

34cfb94

linter

cbd23c5

More unit tests

1bf8755

typo in unit test

50dff15

jameslamb requested changes Mar 19, 2024

View reviewed changes

mayer79 and others added 4 commits March 19, 2024 21:40

linter

02afbab

Co-authored-by: James Lamb <jaylamb20@gmail.com>

linter

af78f49

Co-authored-by: James Lamb <jaylamb20@gmail.com>

linter

8f9eb64

Co-authored-by: James Lamb <jaylamb20@gmail.com>

switch to expected_n_trees

5201946

mayer79 added 2 commits March 19, 2024 21:59

Add lambdarank model

d8186cd

Merge branch 'master' into r-speedup-importance

b9af3c1

jameslamb approved these changes Mar 27, 2024

View reviewed changes

Merge branch 'master' into r-speedup-importance

455177f

jameslamb mentioned this pull request Mar 29, 2024

[ci] Azure Mariner CI jobs regularly failing: "File not found: 'docker'" #6316

Closed

mayer79 mentioned this pull request Mar 31, 2024

[R-package] lightgbm::lgb.model.dt.tree() error caused by lgb.dump() error with large models #6380

Open

Merge branch 'master' into r-speedup-importance

eb8abc7

jameslamb merged commit 5dfe716 into microsoft:master Apr 10, 2024
42 checks passed

jameslamb mentioned this pull request Apr 23, 2024

[R-package] expose start_iteration to dump/save/lgb.model.dt.tree #6398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] Speed-up lgb.importance() #6364

[R-package] Speed-up lgb.importance() #6364

mayer79 commented Mar 15, 2024 •

edited

Loading

jameslamb commented Mar 16, 2024

jameslamb commented Mar 18, 2024

jameslamb left a comment

mayer79 commented Mar 18, 2024

mayer79 commented Mar 18, 2024

jameslamb commented Mar 19, 2024

mayer79 commented Mar 19, 2024

mayer79 commented Mar 19, 2024

jameslamb left a comment

jameslamb left a comment

jameslamb commented Apr 10, 2024

	trees_dt <- lgb.model.dt.tree(bst)
	max_leaf_by_tree_from_dt <- trees_dt[, .(idx = max(leaf_index, na.rm = TRUE)), by = tree_index]$idx
	max_leaf_by_tree_from_preds <- apply(preds_leaf_s3_keyword, 2L, max, na.rm = TRUE)
	expect_equal(max_leaf_by_tree_from_dt, max_leaf_by_tree_from_preds)

	var_gain <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Gain])
	var_cover <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Cover])
	var_freq <- lapply(bst, function(x) lgb.importance(x)[Feature == var_name, Frequency])

	# Ensure that feature gain, cover, and frequency decreases with stronger penalties
	expect_true(all(diff(unlist(var_gain)) <= 0.0))
	expect_true(all(diff(unlist(var_cover)) <= 0.0))
	expect_true(all(diff(unlist(var_freq)) <= 0.0))

	expect_lt(min(diff(unlist(var_gain))), 0.0)
	expect_lt(min(diff(unlist(var_cover))), 0.0)
	expect_lt(min(diff(unlist(var_freq))), 0.0)

	# Ensure that feature is not used when feature_penalty = 0
	expect_length(var_gain[[length(var_gain)]], 0L)

[R-package] Speed-up lgb.importance() #6364

[R-package] Speed-up lgb.importance() #6364

Conversation

mayer79 commented Mar 15, 2024 • edited Loading

Speed-up

Profiler

jameslamb commented Mar 16, 2024

jameslamb commented Mar 18, 2024

jameslamb left a comment

Choose a reason for hiding this comment

mayer79 commented Mar 18, 2024

mayer79 commented Mar 18, 2024

jameslamb commented Mar 19, 2024

mayer79 commented Mar 19, 2024

mayer79 commented Mar 19, 2024

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Apr 10, 2024

mayer79 commented Mar 15, 2024 •

edited

Loading