zi.Rmd

# Zero-Inflated Model


Log likelihood function to estimate parameters for a Zero-inflated Poisson model. With examples
and comparison to <span class="pack" style = "">pscl</span> package output. Also includes approach based on Hilbe GLM text.

Zero-inflated models are applied to situations in which target data has relatively many of one value, usually zero, to go along with the other observed values. They are two-part models, a logistic model for whether an observation is zero or not, and a count model for the other part. The key distinction from hurdle count models is that the count distribution contributes to the excess zeros. While the typical application is count data, the approach can be applied to any distribution in theory.


## Poisson

### Data Setup

Get the data.

```{r zip-setup}
library(tidyverse)

fish = haven::read_dta("http://www.stata-press.com/data/r11/fish.dta")
```

### Function

The log likelihood function.

```{r zip-ll}
zip_ll <- function(y, X, par) {
  # arguments are response y, predictor matrix X, and parameter named starting points of 'logit' and 'pois'
  
  # Extract parameters
  logitpars = par[grep('logit', names(par))]   
  poispars  = par[grep('pois', names(par))]     
  
  # Logit part; in this function Xlogit = Xpois but one could split X argument into Xlogi and Xpois for example
  Xlogit  = X
  LPlogit = Xlogit %*% logitpars
  logi0   = plogis(LPlogit)  # alternative 1/(1+exp(-LPlogit))
    
  # Poisson part
  Xpois  = X
  mupois = exp(Xpois %*% poispars)
  
  # LLs
  logliklogit = log( logi0 + exp(log(1 - logi0) - mupois) )
  loglikpois  = log(1 - logi0) + dpois(y, lambda = mupois, log = TRUE)
  
  # Hilbe formulation
  # logliklogit = log(logi0 + (1 - logi0)*exp(- mupois) )
  # loglikpois = log(1-logi0) -mupois + log(mupois)*y     #not necessary: - log(gamma(y+1))
    
  y0 = y == 0  # 0 values
  yc = y > 0   # Count part

  loglik = sum(logliklogit[y0]) + sum(loglikpois[yc])
  -loglik
}
```


### Estimation


Get starting values or simply do zeros.

```{r zip-starts}
# for zip: need 'logit', 'pois'
initial_model = glm(
  count ~ persons + livebait,
  data = fish,
  x = TRUE,
  y = TRUE,
  "poisson"
)

# starts = c(logit = coef(initial_model), pois = coef(initial_model))  
starts = c(rep(0, 3), rep(0, 3))

names(starts) = c(paste0('pois.', names(coef(initial_model))),
                  paste0('logit.', names(coef(initial_model))))
```


Estimate with <span class="func" style = "">optim</span>.

```{r zip-est}
fit = optim(
  par = starts ,
  fn  = zip_ll,
  X   = initial_model$x,
  y   = initial_model$y,
  method  = "BFGS",
  control = list(maxit = 5000, reltol = 1e-12),
  hessian = TRUE
)

# fit
```


### Comparison

Extract for clean display.

```{r zip-extract}
B  = fit$par
se = sqrt(diag(solve((fit$hessian))))
Z  = B/se
p  = pnorm(abs(Z), lower = FALSE)*2
```

Results from <span class="pack" style = "">pscl</span>.

```{r zip-pscl}
library(pscl)

fit_pscl = zeroinfl(count ~ persons + livebait, data = fish, dist = "poisson") 
```

Compare.

```{r zip-compare-show, echo=FALSE}
init1 = purrr::map(summary(fit_pscl)$coefficients, function(x) {
  data.frame(x)
  colnames(x) = colnames(summary_table)
  x
}) %>%
  do.call(rbind, .) %>%
  as.data.frame() %>%
  rownames_to_column('coef') 

init2 = data.frame(B, se, Z, p) %>% rownames_to_column('coef')

bind_rows(list(pscl = init1, zip_poisson_ll = init2), .id = '') %>% 
  kable_df()
```


## Negative Binomial

### Function

```{r zinb-ll}
zinb_ll <- function(y, X, par) {
  # arguments are response y, predictor matrix X, and parameter named starting points of 'logit', 'negbin', and 'theta'
  
  # Extract parameters
  logitpars  = par[grep('logit', names(par))]
  negbinpars = par[grep('negbin', names(par))]
  theta = exp(par[grep('theta', names(par))])
  
  # Logit part; in this function Xlogit = Xnegbin but one could split X argument into Xlogit and Xnegbin for example
  Xlogit  = X
  LPlogit = Xlogit %*% logitpars
  logi0   = plogis(LPlogit) 
  
  # Negbin part
  Xnegbin = X
  munb = exp(Xnegbin %*% negbinpars)
  
  # LLs
  logliklogit  = 
    log( 
      logi0 + 
        exp(
          log(1 - logi0) + 
            suppressWarnings(dnbinom(0, size = theta, mu   = munb, log  = TRUE))
        )
    )
  
  logliknegbin = log(1 - logi0) + 
    suppressWarnings(dnbinom(y, 
                             size = theta, 
                             mu   = munb, 
                             log  = TRUE))
  
  # Hilbe formulation
  # theta part 
  # alpha = 1/theta  
  # m = 1/alpha
  # p = 1/(1 + alpha*munb)
  
  # logliklogit = log( logi0 + (1 - logi0)*(p^m) )
  # logliknegbin = log(1-logi0) + log(gamma(m+y)) - log(gamma(m)) + m*log(p) + y*log(1-p)   # gamma(y+1) not needed
  
  y0 = y == 0   # 0 values
  yc = y > 0    # Count part
  
  loglik = sum(logliklogit[y0]) + sum(logliknegbin[yc])
  -loglik
}
```


### Estimation

Get starting values or simply do zeros.

```{r zinb-starts}
# for zinb: 'logit', 'negbin', 'theta'
initial_model = model.matrix(count ~ persons + livebait, data = fish) # to get X matrix

startlogi  = glm(count == 0 ~ persons + livebait, data = fish, family = "binomial")
startcount = glm(count ~ persons + livebait, data = fish, family = "poisson")

starts = c(
  negbin = coef(startcount),
  logit = coef(startlogi),
  theta = 1
)  
# starts = c(negbin = rep(0, 3),
#            logit = rep(0, 3),
#            theta = log(1))
```


Estimate with <span class="func" style = "">optim</span>.

```{r zinb-est}
fit_nb = optim(
  par = starts ,
  fn  = zinb_ll,
  X   = initial_model,
  y   = fish$count,
  method  = "BFGS",
  control = list(maxit = 5000, reltol = 1e-12),
  hessian = TRUE
)
# fit_nb
```


### Comparison

Extract for clean display.

```{r zinb-extract}
B  = fit_nb$par
se = sqrt(diag(solve((fit_nb$hessian))))
Z  = B/se
p  = pnorm(abs(Z), lower = FALSE)*2
```

Results from <span class="pack" style = "">pscl</span>.

```{r zinb-pscl}
# pscl results
library(pscl)

fit_pscl = zeroinfl(count ~ persons + livebait, data = fish, dist = "negbin")
```

Compare.


```{r zinb-compare-show, echo=FALSE}
init1 = purrr::map(summary(fit_pscl)$coefficients, function(x) {
  data.frame(x)
  colnames(x) = colnames(summary_table)
  x
}) %>%
  do.call(rbind, .) %>%
  as.data.frame() %>%
  rownames_to_column('coef') 

init2 = data.frame(B, se, Z, p) %>% rownames_to_column('coef')

bind_rows(list(pscl = init1, zinb_ll = init2), .id = '') %>% 
  kable_df()
```


## Supplemental Example

This supplemental example uses the bioChemists data. It contains a sample of 915 biochemistry graduate students with the following information:

- **art**: count of articles produced during last 3 years of Ph.D.
- **fem**: factor indicating gender of student, with levels Men and Women
- **mar**: factor indicating marital status of student, with levels Single and Married
- **kid5**: number of children aged 5 or younger
- **phd**: prestige of Ph.D. department
- **ment**: count of articles produced by Ph.D. mentor during last 3 years


```{r zinb-supplemental}
data("bioChemists", package = "pscl")

initial_model   = model.matrix(art ~ fem + mar + kid5 + phd + ment, data = bioChemists) # to get X matrix
startlogi  = glm(art==0 ~ fem + mar + kid5 + phd + ment, data = bioChemists, family = "binomial")
startcount = glm(art ~ fem + mar + kid5 + phd + ment, data = bioChemists, family = "quasipoisson")

starts = c(
  negbin = coef(startcount),
  logit  = coef(startlogi),
  theta  = summary(startcount)$dispersion
)  

# starts = c(negbin = rep(0, 6),
#            logit = rep(0, 6),
#            theta = 1)


fit_nb_pub = optim(
  par = starts ,
  fn  = zinb_ll,
  X   = initial_model,
  y   = bioChemists$art,
  method  = "BFGS",
  control = list(maxit = 5000, reltol = 1e-12),
  hessian = TRUE
)
# fit_nb_pub


B  = fit_nb_pub$par
se = sqrt(diag(solve((fit_nb_pub$hessian))))
Z  = B/se
p  = pnorm(abs(Z), lower = FALSE)*2


library(pscl)
fit_pscl = zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")

summary(fit_pscl)$coefficients
round(data.frame(B, se, Z, p), 4)
```


## Source

Original code for zip_ll found at
https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/poiszeroinfl.R

Original code for zinb_ll found at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/NBzeroinfl.R