We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial random forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings, including overdispersion, nonlinearity, zero-inflation, measurement error, and ultra-high dimensionality. We compare our methods against established benchmarks: $\ell_1$-penalized regression (lasso), the elastic net ($\alpha = 0.5$), gradient boosting (XGBoost), and a standard random forest applied to $\log(Y+1)$ as a pseudo-response.
\subsection{Simulation Design and Data Generation}
The core structure of our simulations is as follows. For each replication, we generate a training and a test set from a specified DGP. Unless otherwise noted, the baseline parameters are a sample size of $n = 100$ and an ambient dimension of $p = 1000$. Predictors are drawn from a block-correlated multivariate normal distribution, $\mathbf{X}_i \sim \mathcal{N}_p(\mathbf{0}, \Sigma)$, where $\Sigma$ is a block-diagonal matrix. Each block has a size of 50, and the within-block correlation is set to $\rho = 0.5$.
The true regression vector $\boldsymbol{\beta}0 \in \mathbb{R}^p$ is sparse, with only $s = 10$ nonzero coefficients. The active set $\mathcal{S} = {j_1, \ldots, j{10}}$ is chosen uniformly at random. The values of the non-zero coefficients are drawn from a uniform distribution, $\beta_{0j} \sim \mathrm{Unif}(-0.5, 0.5)$ for $j \in \mathcal{S}$, and are zero otherwise. An intercept $\beta_0 = 1$ is included in all models.
We define two broad classes of DGPs based on the structure of the predictor function $f_0(\mathbf{X}_i)$.
\subsubsection{Example 1: Linear Mean Structure}
In this example, the predictor is a linear combination of the true signals:
[
f_0^{\text{(linear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}_i^\top \boldsymbol{\beta}_0.
]
We consider several distributional scenarios built upon this linear structure:
\begin{itemize}
\item \textbf{Poisson (Equidispersed):} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$.
\item \textbf{Negative Binomial (Overdispersed):} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$ and the dispersion parameter is set to $\theta = 1.5$, giving $\mathrm{Var}(Y_i \mid \mathbf{X}i) = \mu_i + \mu_i^2/\theta$.
\item \textbf{Zero-Inflated Poisson (ZIP):} Data are generated from a mixture distribution,
[
Y_i \sim
\begin{cases}
0 & \text{with probability } \pi_i, \
\text{Poisson}(\mu_i) & \text{with probability } 1-\pi_i,
\end{cases}
]
where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}i))$. The zero-inflation probability $\pi_i$ is modeled as $\pi_i = \mathrm{logit}^{-1}(\mathbf{Z}i^\top \boldsymbol{\gamma})$, with $\mathbf{Z}i$ being the first three active predictors and $\boldsymbol{\gamma} = (1, -1, 0.5)^\top$, resulting in approximately 20% excess zeros.
\item \textbf{Measurement Error:} The linear predictor is used, but the features available to the models are corrupted versions $X{ij}^{\ast} = X{ij} + \epsilon{ij}$, where $\epsilon{ij} \sim \mathcal{N}(0, \sigma^2)$ and $\sigma = 0.2$.
\end{itemize}
\subsubsection{Example 2: Nonlinear Mean Structure}
In this example, the predictor function includes nonlinear effects to assess the ability of methods to capture complex signal relationships:
[
f_0^{\text{(nonlinear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}i^\top \boldsymbol{\beta}0 + \gamma_1 X{i,j_1}^2 + \gamma_2 X{i,j_2}^2,
]
where $j_1$ and $j_2$ are the first two features in the active set $\mathcal{S}$, and $\gamma_1 = \gamma_2 = 0.5$.
\begin{itemize}
\item \textbf{Nonlinear Poisson:} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$.
\item \textbf{Nonlinear Negative Binomial:} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$ and $\theta = 1.5$.
\item \textbf{Nonlinear Zero-Inflated Poisson (ZIP):} The nonlinear predictor defines the Poisson mean in the ZIP model described above.
\end{itemize}
\subsubsection{Ultra-High-Dimensional Regime}
To stress-test the algorithms, we include a scenario with $n = 50$ and $p = 5000$, preserving sparsity ($s=10$). Data are generated using the linear Poisson DGP.