`feature_fraction` RNG looks broken

Hi!

First up, thanks for maintaining this amazing package!

I noticed something weird while training a model with three predictors and 1/2 < `feature_fraction` < 5/6. One feature always gets ignored. I think this is because of the per-tree RNG that determines which features are available. As far as I can tell:

1) `NextInt` in [random.h](https://github.com/microsoft/LightGBM/blob/1b5bec0047d687c571ba441885dbf94b8d22423d/include/LightGBM/utils/random.h#L107) gives alternating even and odd numbers, which gets weird when we ask it for modulo even things, particularly modulo 2.
2) There is an off-by-one [here](https://github.com/microsoft/LightGBM/blob/1b5bec0047d687c571ba441885dbf94b8d22423d/include/LightGBM/utils/random.h#L88)? I think it should be `r+1`, but I could be wrong.

### Repro

```python
import lightgbm as lgb
import numpy as np
import pandas as pd

# Make some fake data
N = 1_000
p = 3
seed = 42

np.random.seed(seed)

df = pd.DataFrame(dict(y=np.random.normal(size=N), w=np.random.uniform(size=N)))

features = []
for i in range(p):
    df[str(i)] = np.random.normal(size=N)
    features.append(str(i))

X = lgb.Dataset(df[features], label=df.y)

# Fit a model
params = dict(
    num_iterations=1_000,
    max_depth=3,
    feature_fraction=0.8,
    learning_rate=0.2,
    objective="l2",
    verbosity=-1,
)

regr = lgb.train(params=params, train_set=X)

# One feature is ignored completely, is featured in no splits
regr.feature_importance()
```

### Why I think this is happening

In the select-two-from-three case, we go around [this loop](https://github.com/microsoft/LightGBM/blob/1b5bec0047d687c571ba441885dbf94b8d22423d/include/LightGBM/utils/random.h#L87-L92) exactly twice per tree, and two bad things happen. The first go-around call rolls `NextInt(0, 1)`, which returns 0 in all cases. The zeroth variable always gets included. I think the roll should be `NextInt(0, 2)`, so the zeroth variable has a chance to be left out.

The second go-around rolls `NextInt(0, 2)`, which returns either 0 or 1 depending on seed. This is doing `RandInt32() % 2` under the hood. Since this is the 2nd, 4th, 6th, 8th, ..., call to the RNG, and `RantInt32` alternates between even and odd numbers, it always hits the same value for every tree.

This doesn't just affect the select-two-from-three case, but it gets more complicated to think about. Even-indexed variables get interesting if they get to roll `something_always_even % another_small_even_number`. You can construct other combinations of number of features and `feature_fraction` where variables in the last or second-to-last position get excluded always, or where the model has a significant bias towards using even-indexed variables, etc.

### Fixes?

- There are two branches for determining feature fraction, and the other one looks fine, so just use that one every time? I assume there is a good reason to use two branches though (and I'd be interested to understand it).
- Do something smart with `RantInt32` so it doesn't have even/odd alternation?
- Use `RandInt16` for `NextInt`?

Happy to submit a PR for the first or third one if you think they're OK solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`feature_fraction` RNG looks broken #4371

DexGroves
openedon Jun 11, 2021

Repro

Why I think this is happening

Fixes?

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature_fraction RNG looks broken #4371

Description

DexGrovesopenedon Jun 11, 2021