Skip to content

feature_fraction RNG looks broken #4371

Closed

Description

Hi!

First up, thanks for maintaining this amazing package!

I noticed something weird while training a model with three predictors and 1/2 < feature_fraction < 5/6. One feature always gets ignored. I think this is because of the per-tree RNG that determines which features are available. As far as I can tell:

  1. NextInt in random.h gives alternating even and odd numbers, which gets weird when we ask it for modulo even things, particularly modulo 2.
  2. There is an off-by-one here? I think it should be r+1, but I could be wrong.

Repro

import lightgbm as lgb
import numpy as np
import pandas as pd

# Make some fake data
N = 1_000
p = 3
seed = 42

np.random.seed(seed)

df = pd.DataFrame(dict(y=np.random.normal(size=N), w=np.random.uniform(size=N)))

features = []
for i in range(p):
    df[str(i)] = np.random.normal(size=N)
    features.append(str(i))

X = lgb.Dataset(df[features], label=df.y)

# Fit a model
params = dict(
    num_iterations=1_000,
    max_depth=3,
    feature_fraction=0.8,
    learning_rate=0.2,
    objective="l2",
    verbosity=-1,
)

regr = lgb.train(params=params, train_set=X)

# One feature is ignored completely, is featured in no splits
regr.feature_importance()

Why I think this is happening

In the select-two-from-three case, we go around this loop exactly twice per tree, and two bad things happen. The first go-around call rolls NextInt(0, 1), which returns 0 in all cases. The zeroth variable always gets included. I think the roll should be NextInt(0, 2), so the zeroth variable has a chance to be left out.

The second go-around rolls NextInt(0, 2), which returns either 0 or 1 depending on seed. This is doing RandInt32() % 2 under the hood. Since this is the 2nd, 4th, 6th, 8th, ..., call to the RNG, and RantInt32 alternates between even and odd numbers, it always hits the same value for every tree.

This doesn't just affect the select-two-from-three case, but it gets more complicated to think about. Even-indexed variables get interesting if they get to roll something_always_even % another_small_even_number. You can construct other combinations of number of features and feature_fraction where variables in the last or second-to-last position get excluded always, or where the model has a significant bias towards using even-indexed variables, etc.

Fixes?

  • There are two branches for determining feature fraction, and the other one looks fine, so just use that one every time? I assume there is a good reason to use two branches though (and I'd be interested to understand it).
  • Do something smart with RantInt32 so it doesn't have even/odd alternation?
  • Use RandInt16 for NextInt?

Happy to submit a PR for the first or third one if you think they're OK solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions