[GPU] lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0), lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) #6469
Description
Description
When using LightGBM with GPU training, an error is encountered during the training process. The error specifically occurs when LightGBM attempts to split the data into leaf nodes, resulting in a split where one of the resulting nodes has zero data points. This error does not occur when using CPU training.
Reproducible example
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
def generate_synthetic_data(n_samples=10000, n_features=50):
np.random.seed(42)
X = np.random.rand(n_samples, n_features)
y = np.sum(X, axis=1) + np.random.randn(n_samples) * 0.1
return X, y
def check_data_variability(X_train, y_train):
X_train_df = pd.DataFrame(X_train)
y_train_series = pd.Series(y_train)
print("X_train Feature Variability:")
print(X_train_df.describe().transpose())
print("\nNumber of unique values in each feature:")
print(X_train_df.nunique())
print("\ny_train Target Variability:")
print(y_train_series.describe())
print("Number of unique values in target:", y_train_series.nunique())
def initialize_gpu_model():
params = {
"boosting_type": "gbdt",
"objective": "regression",
"metric": "rmse",
"learning_rate": 0.01,
# "num_leaves": 15,
# "max_depth": 5,
# "min_child_samples": 1,
# "min_child_weight": 1e-3, # Align with min_child_samples
# "min_split_gain": 0.1,
"n_estimators": 10000,
# "subsample": 0.1,
# "subsample_freq": 1,
# "colsample_bytree": 0.1,
# "reg_alpha": 0.1,
# "reg_lambda": 0.1,
"verbose": 100,
"device": "gpu",
}
model = lgb.LGBMRegressor(**params)
print(model.get_params())
return model
def main():
X, y = generate_synthetic_data()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
check_data_variability(X_train, y_train)
model = initialize_gpu_model()
model.fit(
X_train,
y_train,
eval_set=[(X_test, y_test)],
eval_metric="rmse",
)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")
if __name__ == "__main__":
main()
Environment info
lightgbm versions 4.2.0
and 4.3.0
Command(s) you used to install LightGBM
$ sh build-python.sh install --gpu
from the release tag branches
$ cmake -DUSE_GPU=ON
for lib_lightgbm.dylib
macOS 14.4.1 (23E224)
Apple Silicon M1
Tested with python versions 3.10, 3.11, and 3.12, with and without conda
cmake version 3.29.3
Additional Comments
The error only occurs with GPU training (device: "gpu").
The same parameters work fine when device: "cpu" is used.
Adjusting parameters like num_leaves, min_child_samples, max_depth, etc., to more conservative values did not resolve the issue.
Also, it is worth mentioning that generate_synthetic_data(n_samples, n_features)
with n_samples
less than ~2000 do not cause the issue. It only occurs when the data input becomes large, so subsampling can solve the issue but will significantly affect the performance and boosting as I tested.