Skip to content

Latest commit

 

History

History
407 lines (332 loc) · 18.9 KB

contrasts-and-formula.org

File metadata and controls

407 lines (332 loc) · 18.9 KB

Contrasts and @formula in StatsModels.jl

Contrast coding

What and why?

To fit any kind of statistical model you need some kind of numerical representation of your data. Data often comes in a table, a named collection of variables of different types of data. Some of that data is “continuous”, or basically numeric. But often our data is not numeric (or continuous), but “categorical”, having a finite number of distinct levels.

For instance, let’s look at the KB07 dataset:

using DataFrames, DataFramesMeta
using MixedModels, GLM

kb07 = MixedModels.dataset(:kb07)

Here :spkr, :prec, and :load are categortical variables, each of which takes on two different values. If we fit a regression using this dataset, we end up with predictors that refer to specific levels:

f = @formula(rt_trunc ~ 1 + spkr + prec + spkr&prec + (1 | subj))
mod = fit(MixedModel, f, kb07)

Let’s look at a few rows of the fixed effects design matrix that’s generated for this model:

Int.(mod.X)[1:5, :]

A few things to note: all the values are 0 or 1, and there’s one column of all 1s at the start (that’s the (Intercept) term). Columns 2 and 3 correspond to spkr and prec: there’s a 0 where spkr == "new" and a 1 for "old". Note that the coefficient name for this column is spkr: old, which indicates that this predictor indicates the presence of “old”, relative to the (implicit) baseline of “new”. Similarly for prec: maintain.

The last column is the interaction term spkr&prec, and it’s the elementwise product of the columns for spkr: new and pred: maintain.

kb07[1:5, :]

How to take control

You can set your own contrasts via the contrasts= keyword argument in fit, with the variable you want to code as the key and contrasts as the value:

using StatsModels

contrasts = Dict(
    :spkr => EffectsCoding(base = "old"),
    :prec => DummyCoding(levels = ["maintain", "break"])
)

mod2 = fit(MixedModel, f, kb07, contrasts=contrasts)

This example illustrates two ways to control the ordering of levels used to compute the contrasts:

  1. you can use base= to determine the baseline level
  2. you can use levels= to indicate all the levels that are used in the contrasts, the first of which is automatically set as the baseline.

Built in contrast coding schemes

StatsModels.jl provides a few commonly used contrast coding schemes, some less-commonly used schemes, and structs that allow you to manually specify your own, custom schemes.

All are subtypes of the AbstractContrasts type:

subtypes(AbstractContrasts)

And all have fairly extensive documentation via the normal help system. For instance:

?SeqDiffCoding

Standard contrasts

The most commonly used contrasts are DummyCoding and EffectsCoding (which are similar to contr.treatment and contr.sum in R, respectively).

“Exotic” contrasts

We also provide HelmertCoding and SeqDiffCoding (corresponding to base R’s contr.helmert and MASS’s contr.sdiff).

Manual contrasts

There are two ways to manually specify contrasts. First, you can specify them directly via ContrastsCoding. If you do, it’s good practice to specify the levels corresponding to the rows of the matrix, although they can be omitted in which case they’ll be inferred from the data.

For instance, here’s a weird set of contrasts for :spkr:

cs = Matrix([-1/3 2/3]')
contr_manual = Dict(:spkr => StatsModels.ContrastsCoding(cs, levels=["old", "new"]))
mod3 = fit(MixedModel, f, kb07, contrasts=contr_manual)

(Note that the estimates and even the signs of the fixed effect βs change when we change the contrasts, but the overall log-likelihood doesn’t).

We can see that the values from the contrasts matrix we specified are plugged directly in to the fixed effects matrix, and are also used in computing the predictor for the interaction:

mod3.X[1:5, :]

Contrasts from hypotheses

A better way to specify manual contrasts is via HypothesisCoding, where each row of the matrix corresponds to the weights given to the cell means of the levels corresponding to each column (see Schad et al. 2020 for more information). This is less interesting with only two levels, so let’s look at a scenario where we combine :spkr and :prec into a single, 4-level predictor, and want to test some strange hypotheses.

kb07ex = @transform(kb07, spkr_prec = :spkr .* "-" .* :prec);

f2 = @formula(rt_trunc ~ 1 + spkr_prec + (1 | subj))
mod4 = fit(MixedModel, f2, kb07ex)

Let’s say we want to test whether the effect of :prec depends on whether :spkr is old vs. new. We need one contrast to test the hypothesis that "maintain" != "break" for “new”, and another for “old”. That leaves one over, to test the overall difference between “new” and “old”.

levels = ["new-break", "new-maintain", "old-break", "old-maintain"]

prec_old = (levels .== "old-break") .- (levels .== "old-maintain")
prec_new = (levels .== "new-break") .- (levels .== "new-maintain")
old_new = (abs.(prec_old) .- abs.(prec_new)) ./ 2

hypotheses = Matrix(hcat(old_new, prec_old, prec_new)')
contr_hyp = Dict(:spkr_prec => HypothesisCoding(hypotheses,
                                                labels=["old", "(old) break", "(old) maint"]))
mod5 = fit(MixedModel, f2, kb07ex, contrasts = contr_hyp)