- Jump to: Readme |
ppmlhdfe
Paper | Separation Paper | Help File | Separation Primer | Separation Benchmarks | Undocumented Options - Sections: Stata | R | Julia | Matlab | Extra
To the best of our knowledge, no existing statistical software addresses the separation problem in a robust way, more so when working with fixed effects. In this post, we study a few simple examples of separation, and how they affect some of the most popular statistical packages.
We also include 18 example datasets that can be used by package developers to test for correctness of their programs, and we invite further additions to this list.
Note that this is in no way a critique of the packages discussed below, which are in our opinion of excellent quality. Merely, we are using them to show the fact that separation is not only a theoretical problem, but a practical one.
Also, although we focus here on Poisson models, separation is an issue for most generalized linear models (although it is a better understood problem in binary response models).
The table below shows a dataset with 10 observations and three regressors. Observations 1 and 5 are separated because y=0
and z = x2 - x1
is positive in those observations, and zero otherwise.
y | x1 | x2 | x3 |
---|---|---|---|
0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
0 | 1 | 9 | 0 |
2 | 21 | 21 | 21 |
3 | 0 | 0 | 0 |
5 | 0 | 0 | 0 |
7 | 0 | 0 | 0 |
10 | -18 | -18 | 0 |
Let's try to run a Poisson regression on this dataset, exploring the following alternatives:
- The built-in
poisson
command - The
glm
command withpoisson
family, using maximum likelihood optimization.glm
is more modern than thepoisson
command, and often works if the former has convergence issues. - The
glm
command withpoisson
family, estimated via iteratively reweighted least squares (IRLS). - The
ppml
command, by Silvana Tenreyro and João Santos Silva.ppml
usesglm
internally, but first performs a few checks to identify separated observations and the regressors causing them (see also their Economic Letters article):- It first checks if there are collinear regressors on the sample where
y>0
. If there are none, there is no separation - Otherwise, it will only keep a regressor
x
if the following overlap condition applies:min(x | y=0) ≤ mean(x | y>0) ≤ max(x | y=0)
. Note that because the overlap condition is not the second condition for separation, in some cases separated observations are not detected, and in other cases non-separated observations are flagged as separated.
- It first checks if there are collinear regressors on the sample where
- The
ppml
command, with thestrict
option
clear
input int(y x1 x2 x3)
0 0 1 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 9 0
2 21 21 21
3 0 0 0
5 0 0 0
7 0 0 0
10 -18 -18 0
end
poisson y x*, iter(1000) // does not converge, but deviance and LL are close to correct values (see figure)
glm y x*, family(poisson) ml iter(1000) // does not converge, but deviance and LL are close to correct values
glm y x*, family(poisson) irls iter(1000) // does not converge, but deviance and LL are close to correct values
ppml y x* // does not converge, but deviance and LL are close to correct values
ppml y x*, strict // converges to incorrect solution; with b[x3] = 0.961 and deviance and log-likelihoods far from the correct values
ppmlhdfe y x*, sep(none) maxiter(100) // If we disable separation checks, ppmlhdfe will also fail to converge
ppmlhdfe y x* // only command that converges to correct estimate; b[x3] = 0.066
As we can see if we run the code snippet, all of the standard packages either fail to converge, or converge to an incorrect result. For instance, this is the output after the first ten iterations of the poisson
command:
The examples above mostly showed cases where the commands ran forever and failed to converge. However, more problematic than a failure to converge -- which gives no output -- is convergence to an incorrect result, where the user might not even be aware of the misleading estimates. The example below shows one such case, in a setup very similar to the one above:
y | x1 | x2 | x3 | x4 |
---|---|---|---|---|
0 | 0 | 0 | 2 | 10 |
0 | 0 | 0 | 0 | -2 |
0 | 0 | 0 | 0 | 6 |
0 | 0 | 0 | 4 | 5 |
0 | 1 | 0 | 0 | 3 |
2 | 0 | 0 | 0 | 3 |
2 | 0 | 0 | 0 | 4 |
2 | 0 | 0 | -2 | 15 |
2 | 0 | 0 | 0 | -7 |
4 | 0 | 0 | -3 | 15 |
6 | -3 | -3 | 0 | 4 |
6 | 0 | 0 | 0 | 4 |
Here, the fifth observation is separated by the linear combination z = x1 - x2
.
loc url "https://raw.githubusercontent.com/sergiocorreia/ppmlhdfe/master/guides/csv"
insheet using "`url'/example1.csv", clear
ppmlhdfe y x*
poisson y x*, iter(100) nolog
cap noi glm y x*, family(poisson) ml iter(100) nolog
cap noi glm y x*, family(poisson) irls iter(100) nolog
ppml y x*
ppml y x*, strict
Because a linear combination of x1 and x2 is causing the separation, after we remove the separated observation x1 and x2 will be equal. Thus, one of these two variables will be dropped due to collinearity. In the case of ppmlhdfe, it dropped x2:
Note, however, that all other methods report differing results:
poisson
andglm
report an arbitrarily low number for b1 and a high number for b2. For instanceb1=-15.18
withpoisson
, -16.35 withglm,ml
, and -16.47 forglm,irls
.- The estimates for the regressors not involved in the separation are also different across packages:
Method | _b[_cons] | Numerical Error |
---|---|---|
ppmlhdfe (correct) | 0.59095 | |
poisson | 0.59105 | -1.0E-04 |
glm, ml | 0.59086 | 8.5E-05 |
glm, irls | 0.59095 | -3.7E-14 |
ppml | 0.49919 | 9.2E-02 |
ppml, strict | 0.49919 | 9.2E-02 |
The difference here is not extremely high, but a priori it is impossible to know how large will be the error introduced by the statistical package.
We first test the standard glm
package, widely used:
y | x1 | x2 | x3 | x4 |
---|---|---|---|---|
0 | 14 | 4 | 0 | -12 |
0 | 0 | 35 | 34 | 12 |
0 | 0 | 3 | 0 | 17 |
0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | -2 | 7 |
0 | 0 | 25 | 0 | 12 |
0 | 0 | 13 | 0 | 60 |
0 | 0 | 15 | 0 | 7 |
0 | 1 | 0 | 7 | -24 |
9 | 0 | 0 | 0 | 18 |
4 | 2 | 0 | 6 | -1 |
2 | 1 | 0 | 0 | -7 |
> data <- read.csv(file="example2.csv", header=TRUE, sep=",")
> formula <- y ~ x1 + x2 + x3 + x4
> mod <- glm(formula, data, family=poisson())
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted rates numerically 0 occurred
> summary(mod)
Call:
glm(formula = formula, family = poisson(), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.109e-03 -2.000e-08 -2.000e-08 -2.000e-08 2.000e-08
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -367.83 7448.51 -0.049 0.961
x1 512.42 10345.15 0.050 0.960
x2 -1644.86 33284.96 -0.049 0.961
x3 -105.85 2138.00 -0.050 0.961
x4 20.56 413.81 0.050 0.960
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.6719e+01 on 11 degrees of freedom
Residual deviance: 9.6716e-06 on 7 degrees of freedom
AIC: 19.933
Number of Fisher Scoring iterations: 25
As expected, the results don't converge, but at least glm
gives us a warning that there were some perfect fits at zero. Note however that these warnings are not always given:
y | x1 | x2 | x3 | x4 |
---|---|---|---|---|
0 | 0 | 0 | 2 | 10 |
0 | 0 | 0 | 0 | -2 |
0 | 0 | 0 | 0 | 6 |
0 | 0 | 0 | 4 | 5 |
0 | 1 | 0 | 0 | 3 |
2 | 0 | 0 | 0 | 3 |
2 | 0 | 0 | 0 | 4 |
2 | 0 | 0 | -2 | 15 |
2 | 0 | 0 | 0 | -7 |
4 | 0 | 0 | -3 | 15 |
6 | -3 | -3 | 0 | 4 |
6 | 0 | 0 | 0 | 4 |
> data <- read.csv(file="example1.csv", header=TRUE, sep=",")
> formula <- y ~ x1 + x2 + x3 + x4
> mod <- glm(formula, data, family=poisson())
> summary(mod)
Call:
glm(formula = formula, family = poisson(), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.97355 -0.75131 -0.16879 0.07357 2.70708
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.59095 0.30291 1.951 0.0511 .
x1 -17.78017 3467.85854 -0.005 0.9959
x2 17.32952 3467.85856 0.005 0.9960
x3 -0.47085 0.23117 -2.037 0.0417 *
x4 -0.03779 0.04375 -0.864 0.3878
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 31.912 on 11 degrees of freedom
Residual deviance: 15.956 on 7 degrees of freedom
AIC: 46.991
Number of Fisher Scoring iterations: 15
Here x1 - x2
equals 1 on the 5th observation and zero otherwise. This explains why x1 and x2 have very high negative and positive values, respectively. However, unlike the previous case, no warning is given.
An alternative to glm
is the glm2
package by Ian Marschner and Mark W. Donoghoe. As described here, it is aimed at models where glm
has difficulty converging, such as GLMs with family=poisson
and link=identity
(instead of the more common link=log
, which corresponds to a poisson regression). It works by applying step-halving on the iteratively reweighted least squares (IRLS) part in cases if the iteration seems to be having trouble converging.
Because the innovation done by glm2
is unrelated to the separation problem, we don't expect glm2
to perform better than glm
in examples where separation is present. As shown below, that is indeed the case. In both examples, the output is identical to that of glm
:
> data <- read.csv(file="csv/example2.csv", header=TRUE, sep=",")
> formula <- y ~ x1 + x2 + x3 + x4
> mod <- glm2(formula, data, family=poisson())
Warning messages:
1: glm.fit2: algorithm did not converge. Try increasing the maximum iterations
2: glm.fit2: fitted rates numerically 0 occurred
> summary(mod)
Call:
glm2(formula = formula, family = poisson(), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.109e-03 -2.000e-08 -2.000e-08 -2.000e-08 2.000e-08
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -367.83 7448.51 -0.049 0.961
x1 512.42 10345.15 0.050 0.960
x2 -1644.86 33284.96 -0.049 0.961
x3 -105.85 2138.00 -0.050 0.961
x4 20.56 413.81 0.050 0.960
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4.6719e+01 on 11 degrees of freedom
Residual deviance: 9.6716e-06 on 7 degrees of freedom
AIC: 19.933
Number of Fisher Scoring iterations: 25
> data <- read.csv(file="csv/example1.csv", header=TRUE, sep=",")
> formula <- y ~ x1 + x2 + x3 + x4
> mod <- glm2(formula, data, family=poisson())
> summary(mod)
Call:
glm2(formula = formula, family = poisson(), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.97355 -0.75131 -0.16879 0.07357 2.70708
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.59095 0.30291 1.951 0.0511 .
x1 -17.78017 3467.85856 -0.005 0.9959
x2 17.32952 3467.85857 0.005 0.9960
x3 -0.47085 0.23117 -2.037 0.0417 *
x4 -0.03779 0.04375 -0.864 0.3878
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 31.912 on 11 degrees of freedom
Residual deviance: 15.956 on 7 degrees of freedom
AIC: 46.991
Number of Fisher Scoring iterations: 15
We will now test Amrei Stammann's great alpaca
package, which allows for fixed effects in Poisson models (repo and paper; seriously, try it out if you are an R user!).
First, using fe1.csv we can observe that b2=-18.99:
> # install.packages("alpaca")
> data <- read.csv(file="fe1.csv", header=TRUE, sep=",")
> library(alpaca)
> formula <- y ~ x1 + x2 | i + j
> mod <- feglm(formula, data, family=poisson())
> summary(mod)
poisson
y ~ x1 + x2 | i + j
l= [4, 4], n= 18, deviance= 13.3644
Structural parameter(s):
Estimate Std. error z value Pr(> |z|)
x1 -0.4845 1.2439 -0.390 0.697
x2 -18.9906 3934.0725 -0.005 0.996
However, if we instead run this with ppmlhdfe
, we would observe that 4 out of the 13 observations are separated, and that the estimates for b2 are not -18.99
.
insheet using "csv/fe1.csv", clear
ppmlhdfe y x*, a(i j)
Another possible outcome, as we see with fe2.csv, is that the package crashes:
> data <- read.csv(file="fe2.csv", header=TRUE, sep=",")
> library(alpaca)
> formula <- y ~ x1 | i + j
> mod <- feglm(formula, data, family=poisson())
Error in feglm(formula, data, family = poisson()) :
Backtracking (step-halving) failed.
Finally, as shown with fe3.csv, in some situations alpaca
can detect some of the separated observations, although not all of them:
> data <- read.csv(file="fe3.csv", header=TRUE, sep=",")
> library(alpaca)
> formula <- y ~ x1 | i + j
> mod <- feglm(formula, data, family=poisson())
> summary(mod)
poisson
y ~ x1 | i + j
l= [2, 2], n= 9, deviance= 13.9161
Structural parameter(s):
Estimate Std. error z value Pr(> |z|)
x1 -0.2995 0.3101 -0.966 0.334
( 7 observation(s) deleted due to perfect classification )`
Here, alpaca
dropped 7 out of the 8 separated observations, failing to drop the 10th observation. We can verify this with Stata:
insheet using "csv/fe3.csv", clear
ppmlhdfe y x1, a(i j) // drops 8 separated obs.
gen z = 2 * (i==4) - 2 * (i==1) - 2 * (i==6) - 6 * (j==1) - 6 * (j==3) - 2 * (j==4) - 1 * (j==5) + 8 // we can verify that the observations are indeed separated
The code below runs a GLM regression in Julia, with the data in example1.csv.
import Pkg
#Pkg.add("CSV")
#Pkg.add("DataFrames")
#Pkg.add("GLM")
#Pkg.add("CategoricalArrays")
using CSV, DataFrames, GLM
data = CSV.File("./csv/example1.csv") |> DataFrame
ans = fit(GeneralizedLinearModel, @formula(y ~ x1 + x2 + x3 + x4), data, Poisson())
round(deviance(ans), digits=5)
As in the case witH R's GLM, b1 is arbitrarily low (-12.7) and b2 arbitrarily high (12.3).
With example2.csv instead, the package raises an AssertionError
:
import Pkg
using CSV, DataFrames, GLM
data = CSV.File("./csv/example2.csv") |> DataFrame
ans = fit(GeneralizedLinearModel, @formula(y ~ x1 + x2 + x3 + x4), data, Poisson())
round(deviance(ans), digits=5)
Finally, similar issues also occur in models with fixed effects (note the -13.5 estimate for b2):
import Pkg
using CSV, DataFrames, GLM, CategoricalArrays
data = CSV.File("csv/fe1.csv") |> DataFrame
data.id1 = CategoricalArray(data.i)
data.id2 = CategoricalArray(data.j)
ans = fit(GeneralizedLinearModel, @formula(y ~ x1 + x2 + id1 + id2), data, Poisson())
round(deviance(ans), digits=5)
(Also see this issue on Julia's GLM repo)
We first run the data in example1.csv using Matlab's glmfit command.
data = readtable('example1.csv');
X = [data.x1 data.x2 data.x3 data.x4];
Y = data.y;
[b,dev,stats] = glmfit(X,Y , 'poisson');
b'
dev
In this case, glmfit
failed to converge, although it warned about possible data scaling issues.
Similarly, the fitglm
command also fails to converge:
Similarly with example2.csv:
data = readtable('example2.csv');
X = [data.x1 data.x2 data.x3 data.x4];
Y = data.y;
model = fitglm(X, Y, 'linear', 'Distribution', 'Poisson');
- Python: StatsModels
- Spark