GitHub - rid4stat/RFCount-Regression: We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial random forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings.

We conduct a comprehensive simulation study to evaluate the finite-sample performance of the proposed Poisson and Negative Binomial random forest methods. The study is designed to assess performance across a range of challenging data-generating processes (DGPs) common in high-dimensional settings, including overdispersion, nonlinearity, zero-inflation, measurement error, and ultra-high dimensionality. We compare our methods against established benchmarks: $\ell_1$-penalized regression (lasso), the elastic net ($\alpha = 0.5$), gradient boosting (XGBoost), and a standard random forest applied to $\log(Y+1)$ as a pseudo-response.

\subsection{Simulation Design and Data Generation}

The core structure of our simulations is as follows. For each replication, we generate a training and a test set from a specified DGP. Unless otherwise noted, the baseline parameters are a sample size of $n = 100$ and an ambient dimension of $p = 1000$. Predictors are drawn from a block-correlated multivariate normal distribution, $\mathbf{X}_i \sim \mathcal{N}_p(\mathbf{0}, \Sigma)$, where $\Sigma$ is a block-diagonal matrix. Each block has a size of 50, and the within-block correlation is set to $\rho = 0.5$.

The true regression vector $\boldsymbol{\beta}0 \in \mathbb{R}^p$ is sparse, with only $s = 10$ nonzero coefficients. The active set $\mathcal{S} = {j_1, \ldots, j{10}}$ is chosen uniformly at random. The values of the non-zero coefficients are drawn from a uniform distribution, $\beta_{0j} \sim \mathrm{Unif}(-0.5, 0.5)$ for $j \in \mathcal{S}$, and are zero otherwise. An intercept $\beta_0 = 1$ is included in all models.

We define two broad classes of DGPs based on the structure of the predictor function $f_0(\mathbf{X}_i)$.

\subsubsection{Example 1: Linear Mean Structure} In this example, the predictor is a linear combination of the true signals: [ f_0^{\text{(linear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}_i^\top \boldsymbol{\beta}_0. ] We consider several distributional scenarios built upon this linear structure: \begin{itemize} \item \textbf{Poisson (Equidispersed):} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$. \item \textbf{Negative Binomial (Overdispersed):} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}_i))$ and the dispersion parameter is set to $\theta = 1.5$, giving $\mathrm{Var}(Y_i \mid \mathbf{X}i) = \mu_i + \mu_i^2/\theta$. \item \textbf{Zero-Inflated Poisson (ZIP):} Data are generated from a mixture distribution, [ Y_i \sim \begin{cases} 0 & \text{with probability } \pi_i, \ \text{Poisson}(\mu_i) & \text{with probability } 1-\pi_i, \end{cases} ] where $\mu_i = \exp(f_0^{\text{(linear)}}(\mathbf{X}i))$. The zero-inflation probability $\pi_i$ is modeled as $\pi_i = \mathrm{logit}^{-1}(\mathbf{Z}i^\top \boldsymbol{\gamma})$, with $\mathbf{Z}i$ being the first three active predictors and $\boldsymbol{\gamma} = (1, -1, 0.5)^\top$, resulting in approximately 20% excess zeros. \item \textbf{Measurement Error:} The linear predictor is used, but the features available to the models are corrupted versions $X{ij}^{\ast} = X{ij} + \epsilon{ij}$, where $\epsilon{ij} \sim \mathcal{N}(0, \sigma^2)$ and $\sigma = 0.2$. \end{itemize} \subsubsection{Example 2: Nonlinear Mean Structure} In this example, the predictor function includes nonlinear effects to assess the ability of methods to capture complex signal relationships: [ f_0^{\text{(nonlinear)}}(\mathbf{X}_i) = \beta_0 + \mathbf{X}i^\top \boldsymbol{\beta}0 + \gamma_1 X{i,j_1}^2 + \gamma_2 X{i,j_2}^2, ] where $j_1$ and $j_2$ are the first two features in the active set $\mathcal{S}$, and $\gamma_1 = \gamma_2 = 0.5$. \begin{itemize} \item \textbf{Nonlinear Poisson:} $Y_i \mid \mathbf{X}_i \sim \text{Poisson}(\mu_i)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$. \item \textbf{Nonlinear Negative Binomial:} $Y_i \mid \mathbf{X}_i \sim \text{NB}(\mu_i, \theta)$, where $\mu_i = \exp(f_0^{\text{(nonlinear)}}(\mathbf{X}_i))$ and $\theta = 1.5$. \item \textbf{Nonlinear Zero-Inflated Poisson (ZIP):} The nonlinear predictor defines the Poisson mean in the ZIP model described above. \end{itemize}

\subsubsection{Ultra-High-Dimensional Regime} To stress-test the algorithms, we include a scenario with $n = 50$ and $p = 5000$, preserving sparsity ($s=10$). Data are generated using the linear Poisson DGP.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
HD count regression RF code.R		HD count regression RF code.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages