Skip to content

LightGBM cannot find splits with very unbalanced weights #5935

Open
@adfea9c0

Description

Description

When there exists a split where the weight of one side of the split is much larger than the other side, LightGBM cannot find it.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb

def outlier_with_weight(w, N=1_000):
    vec = np.random.choice([-1, 1], size=(N,))
    df = pd.DataFrame({
        "x": vec,
        "y": vec,  # Easiest regression task ever
        "w": 1.0,
    })
    df.at[0, "w"] = w  # One outlier with weight w

    ds = lgb.Dataset(df[['x']], label=df['y'], weight=df['w'])
    params = {
        "num_iterations": 1,         # We just need ONE TREE
        "num_leaves": 2,             # and ONE SPLIT
        "learning_rate": 1.0,
        "boost_from_average": False, # Train from 0
        "verbose": -1,
        "seed": 1,
        "deterministic": True,
    }
    
    model = lgb.train(params, train_set=ds)
    return np.mean((df["y"] - model.predict(df[["x"]]))**2)

for w in [1, 1_000, 1_000_000]:
    print(f"MSE for outlier {w} is {outlier_with_weight(w=w)}")

>>> MSE for outlier 1 is 0.0
>>> MSE for outlier 1000 is 0.0
>>> MSE for outlier 1000000 is 1.0

Environment info

Reproducible in 3.3.5

Additional Comments

When computing histograms, LightGBM only maintains the gradient and hessian, not the number of samples in the bin (presumably because that would increase memory usage by some 50%). Instead the number of samples is reconstructed from the Hessian [1]. However this is done by just dividing the Hessian by the average value of the Hessian across the samples in the leaf.

In the example I gave above, for the last weight we have cnt_factor = 1_000 / (999 + 1_000_000) < 0.001, meaning that the side of the split without the high weight row will have its number of rows approximated as ~500 * 0.001 which rounds to 0, so LightGBM thinks one side of the split is emtpy.

This is kind of an extreme example, presumably if your problem suffers from this bug then something is wrong. But at the edges, this may result in some cuts being incorrectly accepted or rejected by the min_data_in_leaf parameter.

[1] https://github.com/microsoft/LightGBM/blob/v3.3.5/src/treelearner/feature_histogram.hpp#L898-L899

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions