Description
openedon Jun 11, 2021
Hi!
First up, thanks for maintaining this amazing package!
I noticed something weird while training a model with three predictors and 1/2 < feature_fraction
< 5/6. One feature always gets ignored. I think this is because of the per-tree RNG that determines which features are available. As far as I can tell:
NextInt
in random.h gives alternating even and odd numbers, which gets weird when we ask it for modulo even things, particularly modulo 2.- There is an off-by-one here? I think it should be
r+1
, but I could be wrong.
Repro
import lightgbm as lgb
import numpy as np
import pandas as pd
# Make some fake data
N = 1_000
p = 3
seed = 42
np.random.seed(seed)
df = pd.DataFrame(dict(y=np.random.normal(size=N), w=np.random.uniform(size=N)))
features = []
for i in range(p):
df[str(i)] = np.random.normal(size=N)
features.append(str(i))
X = lgb.Dataset(df[features], label=df.y)
# Fit a model
params = dict(
num_iterations=1_000,
max_depth=3,
feature_fraction=0.8,
learning_rate=0.2,
objective="l2",
verbosity=-1,
)
regr = lgb.train(params=params, train_set=X)
# One feature is ignored completely, is featured in no splits
regr.feature_importance()
Why I think this is happening
In the select-two-from-three case, we go around this loop exactly twice per tree, and two bad things happen. The first go-around call rolls NextInt(0, 1)
, which returns 0 in all cases. The zeroth variable always gets included. I think the roll should be NextInt(0, 2)
, so the zeroth variable has a chance to be left out.
The second go-around rolls NextInt(0, 2)
, which returns either 0 or 1 depending on seed. This is doing RandInt32() % 2
under the hood. Since this is the 2nd, 4th, 6th, 8th, ..., call to the RNG, and RantInt32
alternates between even and odd numbers, it always hits the same value for every tree.
This doesn't just affect the select-two-from-three case, but it gets more complicated to think about. Even-indexed variables get interesting if they get to roll something_always_even % another_small_even_number
. You can construct other combinations of number of features and feature_fraction
where variables in the last or second-to-last position get excluded always, or where the model has a significant bias towards using even-indexed variables, etc.
Fixes?
- There are two branches for determining feature fraction, and the other one looks fine, so just use that one every time? I assume there is a good reason to use two branches though (and I'd be interested to understand it).
- Do something smart with
RantInt32
so it doesn't have even/odd alternation? - Use
RandInt16
forNextInt
?
Happy to submit a PR for the first or third one if you think they're OK solutions.