LightGBM cannot find splits with very unbalanced weights #5935
Description
Description
When there exists a split where the weight of one side of the split is much larger than the other side, LightGBM cannot find it.
Reproducible example
import numpy as np
import pandas as pd
import lightgbm as lgb
def outlier_with_weight(w, N=1_000):
vec = np.random.choice([-1, 1], size=(N,))
df = pd.DataFrame({
"x": vec,
"y": vec, # Easiest regression task ever
"w": 1.0,
})
df.at[0, "w"] = w # One outlier with weight w
ds = lgb.Dataset(df[['x']], label=df['y'], weight=df['w'])
params = {
"num_iterations": 1, # We just need ONE TREE
"num_leaves": 2, # and ONE SPLIT
"learning_rate": 1.0,
"boost_from_average": False, # Train from 0
"verbose": -1,
"seed": 1,
"deterministic": True,
}
model = lgb.train(params, train_set=ds)
return np.mean((df["y"] - model.predict(df[["x"]]))**2)
for w in [1, 1_000, 1_000_000]:
print(f"MSE for outlier {w} is {outlier_with_weight(w=w)}")
>>> MSE for outlier 1 is 0.0
>>> MSE for outlier 1000 is 0.0
>>> MSE for outlier 1000000 is 1.0
Environment info
Reproducible in 3.3.5
Additional Comments
When computing histograms, LightGBM only maintains the gradient and hessian, not the number of samples in the bin (presumably because that would increase memory usage by some 50%). Instead the number of samples is reconstructed from the Hessian [1]. However this is done by just dividing the Hessian by the average value of the Hessian across the samples in the leaf.
In the example I gave above, for the last weight we have cnt_factor = 1_000 / (999 + 1_000_000) < 0.001
, meaning that the side of the split without the high weight row will have its number of rows approximated as ~500 * 0.001
which rounds to 0, so LightGBM thinks one side of the split is emtpy.
This is kind of an extreme example, presumably if your problem suffers from this bug then something is wrong. But at the edges, this may result in some cuts being incorrectly accepted or rejected by the min_data_in_leaf
parameter.
[1] https://github.com/microsoft/LightGBM/blob/v3.3.5/src/treelearner/feature_histogram.hpp#L898-L899
Activity