Skip to content

rid4stat/RFCount-Regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial random forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings, including overdispersion, nonlinearity, zero-inflation, measurement error, and ultra-high dimensionality. We compare our methods against established benchmarks: $\ell_1$-penalized regression (lasso), the elastic net ($\alpha = 0.5$), gradient boosting (XGBoost), and a standard random forest applied to $\log(Y+1)$ as a pseudo-response.

\subsection{Simulation Design and Data Generation}

The core structure of our simulations is as follows. For each replication, we generate a training and a test set from a specified DGP. Unless otherwise noted, the baseline parameters are a sample size of $n = 100$ and an ambient dimension of $p = 1000$. Predictors are drawn from a block-correlated multivariate normal distribution, $\mathbf{X}_i \sim \mathcal{N}_p(\mathbf{0}, \Sigma)$, where $\Sigma$ is a block-diagonal matrix. Each block has a size of 50, and the within-block correlation is set to $\rho = 0.5$.

The true regression vector $\boldsymbol{\beta}0 \in \mathbb{R}^p$ is sparse, with only $s = 10$ nonzero coefficients. The active set $\mathcal{S} = {j_1, \ldots, j{10}}$ is chosen uniformly at random. The values of the non-zero coefficients are drawn from a uniform distribution, $\beta_{0j} \sim \mathrm{Unif}(-0.5, 0.5)$ for $j \in \mathcal{S}$, and are zero otherwise. An intercept $\beta_0 = 1$ is included in all models.

We define two broad classes of DGPs based on the structure of the predictor function $f_0(\mathbf{X}_i)$.

\subsubsection{Example 1: Linear Mean Structure} In this example, the predictor is a linear combination of the true signals: [ f_0^{\text{(linear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}_i^\top \boldsymbol{\beta}_0. ] We consider several distributional scenarios built upon this linear structure: \begin{itemize} \item \textbf{Poisson (Equidispersed):} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$. \item \textbf{Negative Binomial (Overdispersed):} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$ and the dispersion parameter is set to $\theta = 1.5$, giving $\mathrm{Var}(Y_i \mid \mathbf{X}i) = \mu_i + \mu_i^2/\theta$. \item \textbf{Zero-Inflated Poisson (ZIP):} Data are generated from a mixture distribution, [ Y_i \sim \begin{cases} 0 & \text{with probability } \pi_i, \ \text{Poisson}(\mu_i) & \text{with probability } 1-\pi_i, \end{cases} ] where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}i))$. The zero-inflation probability $\pi_i$ is modeled as $\pi_i = \mathrm{logit}^{-1}(\mathbf{Z}i^\top \boldsymbol{\gamma})$, with $\mathbf{Z}i$ being the first three active predictors and $\boldsymbol{\gamma} = (1, -1, 0.5)^\top$, resulting in approximately 20% excess zeros. \item \textbf{Measurement Error:} The linear predictor is used, but the features available to the models are corrupted versions $X{ij}^{\ast} = X{ij} + \epsilon{ij}$, where $\epsilon{ij} \sim \mathcal{N}(0, \sigma^2)$ and $\sigma = 0.2$. \end{itemize} \subsubsection{Example 2: Nonlinear Mean Structure} In this example, the predictor function includes nonlinear effects to assess the ability of methods to capture complex signal relationships: [ f_0^{\text{(nonlinear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}i^\top \boldsymbol{\beta}0 + \gamma_1 X{i,j_1}^2 + \gamma_2 X{i,j_2}^2, ] where $j_1$ and $j_2$ are the first two features in the active set $\mathcal{S}$, and $\gamma_1 = \gamma_2 = 0.5$. \begin{itemize} \item \textbf{Nonlinear Poisson:} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$. \item \textbf{Nonlinear Negative Binomial:} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$ and $\theta = 1.5$. \item \textbf{Nonlinear Zero-Inflated Poisson (ZIP):} The nonlinear predictor defines the Poisson mean in the ZIP model described above. \end{itemize}

\subsubsection{Ultra-High-Dimensional Regime} To stress-test the algorithms, we include a scenario with $n = 50$ and $p = 5000$, preserving sparsity ($s=10$). Data are generated using the linear Poisson DGP.

About

We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial random forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages