[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold

## Description

LightGBM with `device='cuda'` crashes with SIGFPE (Floating point exception) when training on discrete data where the product of unique values and number of features exceeds a certain threshold.

## Reproducible Example

```python
import lightgbm as lgb
import numpy as np

# FAIL: 5 discrete values × 600 features
X = np.random.randint(0, 5, (50000, 600)).astype(np.float32)
y = np.random.uniform(0, 1, 50000).astype(np.float32)

model = lgb.LGBMRegressor(device='cuda', n_estimators=10, verbose=-1)
model.fit(X, y)  # SIGFPE: Floating point exception (core dumped)
```

## Test Results

Tested with 50,000 rows:

| Unique Values | 500 cols | 600 cols | 700 cols |
|:-------------:|:--------:|:--------:|:--------:|
| 2 | Pass | Pass | Pass |
| 3 | Pass | Pass | SIGFPE |
| 4 | Pass | Pass | SIGFPE |
| 5 | Pass | SIGFPE | SIGFPE |
| 6 | Pass | SIGFPE | SIGFPE |
| 7 | Pass | SIGFPE | SIGFPE |
| 8 | Pass | SIGFPE | SIGFPE |

### Observed Pattern

The crash threshold depends on both the number of unique values and the number of features:

| Unique Values | Approximate Safe Column Limit |
|:-------------:|:-----------------------------:|
| 2 | 700+ |
| 3-4 | 600-700 |
| 5+ | 500 |

This suggests a relationship between `n_unique * n_features` and available CUDA histogram bins.

## Workaround

Adding tiny noise converts discrete values to continuous and avoids the crash:

```python
X = X.astype(np.float32)
X += np.random.uniform(-1e-6, 1e-6, X.shape).astype(np.float32)
# Training now succeeds
```

## Real-World Impact

This bug affects Numerai tournament data:
- 2.7M rows × 2376 features
- int8 dtype with values {0, 1, 2, 3, 4} (5 discrete values)
- Always triggers SIGFPE with CUDA

## Environment

- LightGBM version: 4.6.0.99 (source-built with GCC 10)
- CUDA version: 12.6
- GPU: NVIDIA RTX 5000 Ada Generation (Compute Capability 8.9)
- Driver: 572.16
- OS: Windows 11 + WSL2 (Ubuntu) + Docker (nvidia-docker)
- Python: 3.10

## Build Command

```bash
cmake -B build -S . \
    -DUSE_CUDA=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=gcc-10 \
    -DCMAKE_CXX_COMPILER=g++-10
cmake --build build -j$(nproc)
cd python-package && pip install .
```

## Notes

- CPU training works fine with the same data
- The issue appears to be in CUDA histogram binning logic
- Row count does not significantly affect the threshold

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold #7122

Description

Reproducible Example

Test Results

Observed Pattern

Workaround

Real-World Impact

Environment

Build Command

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unique Values	500 cols	600 cols	700 cols
2	Pass	Pass	Pass
3	Pass	Pass	SIGFPE
4	Pass	Pass	SIGFPE
5	Pass	SIGFPE	SIGFPE
6	Pass	SIGFPE	SIGFPE
7	Pass	SIGFPE	SIGFPE
8	Pass	SIGFPE	SIGFPE

[CUDA] SIGFPE (Floating point exception) with discrete data when n_unique_values * n_features exceeds threshold #7122

Description

Description

Reproducible Example

Test Results

Observed Pattern

Workaround

Real-World Impact

Environment

Build Command

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions