Skip to content

[GPU] lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) #6469

Open
@Oct4Pie

Description

Description

When using LightGBM with GPU training, an error is encountered during the training process. The error specifically occurs when LightGBM attempts to split the data into leaf nodes, resulting in a split where one of the resulting nodes has zero data points. This error does not occur when using CPU training.

Reproducible example

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


def generate_synthetic_data(n_samples=10000, n_features=50):
    np.random.seed(42)
    X = np.random.rand(n_samples, n_features)
    y = np.sum(X, axis=1) + np.random.randn(n_samples) * 0.1
    return X, y


def check_data_variability(X_train, y_train):
    X_train_df = pd.DataFrame(X_train)
    y_train_series = pd.Series(y_train)

    print("X_train Feature Variability:")
    print(X_train_df.describe().transpose())
    print("\nNumber of unique values in each feature:")
    print(X_train_df.nunique())

    print("\ny_train Target Variability:")
    print(y_train_series.describe())
    print("Number of unique values in target:", y_train_series.nunique())


def initialize_gpu_model():
    params = {
        "boosting_type": "gbdt",
        "objective": "regression",
        "metric": "rmse",
        "learning_rate": 0.01,
        # "num_leaves": 15,
        # "max_depth": 5,
        # "min_child_samples": 1,
        # "min_child_weight": 1e-3,  # Align with min_child_samples
        # "min_split_gain": 0.1,
        "n_estimators": 10000,
        # "subsample": 0.1,
        # "subsample_freq": 1,
        # "colsample_bytree": 0.1,
        # "reg_alpha": 0.1,
        # "reg_lambda": 0.1,
        "verbose": 100,
        "device": "gpu",
    }
    model = lgb.LGBMRegressor(**params)
    print(model.get_params())
    return model


def main():
    X, y = generate_synthetic_data()
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    check_data_variability(X_train, y_train)

    model = initialize_gpu_model()

    model.fit(
        X_train,
        y_train,
        eval_set=[(X_test, y_test)],
        eval_metric="rmse",
    )
    y_pred = model.predict(X_test)

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"RMSE: {rmse}")


if __name__ == "__main__":
    main()

Environment info

lightgbm versions 4.2.0 and 4.3.0

Command(s) you used to install LightGBM

$ sh build-python.sh install --gpu

from the release tag branches

$ cmake -DUSE_GPU=ON

for lib_lightgbm.dylib

macOS 14.4.1 (23E224)
Apple Silicon M1
Tested with python versions 3.10, 3.11, and 3.12, with and without conda
cmake version 3.29.3

Additional Comments

The error only occurs with GPU training (device: "gpu").
The same parameters work fine when device: "cpu" is used.
Adjusting parameters like num_leaves, min_child_samples, max_depth, etc., to more conservative values did not resolve the issue.
Also, it is worth mentioning that generate_synthetic_data(n_samples, n_features) with n_samples less than ~2000 do not cause the issue. It only occurs when the data input becomes large, so subsampling can solve the issue but will significantly affect the performance and boosting as I tested.

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions