To fit any kind of statistical model you need some kind of numerical representation of your data. Data often comes in a table, a named collection of variables of different types of data. Some of that data is “continuous”, or basically numeric. But often our data is not numeric (or continuous), but “categorical”, having a finite number of distinct levels.
For instance, let’s look at the KB07 dataset:
using DataFrames, DataFramesMeta
using MixedModels, GLM
kb07 = MixedModels.dataset(:kb07)
Here :spkr
, :prec
, and :load
are categortical variables, each of which
takes on two different values. If we fit a regression using this dataset, we
end up with predictors that refer to specific levels:
f = @formula(rt_trunc ~ 1 + spkr + prec + spkr&prec + (1 | subj))
mod = fit(MixedModel, f, kb07)
Let’s look at a few rows of the fixed effects design matrix that’s generated for this model:
Int.(mod.X)[1:5, :]
A few things to note: all the values are 0 or 1, and there’s one column of all
1s at the start (that’s the (Intercept)
term). Columns 2 and 3 correspond to
spkr
and prec
: there’s a 0 where spkr == "new"
and a 1 for "old"
.
Note that the coefficient name for this column is spkr: old
, which indicates
that this predictor indicates the presence of “old”, relative to the (implicit)
baseline of “new”. Similarly for prec: maintain
.
The last column is the interaction term spkr&prec
, and it’s the elementwise
product of the columns for spkr: new
and pred: maintain
.
kb07[1:5, :]
You can set your own contrasts via the contrasts=
keyword argument in fit
,
with the variable you want to code as the key and contrasts as the value:
using StatsModels
contrasts = Dict(
:spkr => EffectsCoding(base = "old"),
:prec => DummyCoding(levels = ["maintain", "break"])
)
mod2 = fit(MixedModel, f, kb07, contrasts=contrasts)
This example illustrates two ways to control the ordering of levels used to compute the contrasts:
- you can use
base=
to determine the baseline level - you can use
levels=
to indicate all the levels that are used in the contrasts, the first of which is automatically set as the baseline.
StatsModels.jl provides a few commonly used contrast coding schemes, some less-commonly used schemes, and structs that allow you to manually specify your own, custom schemes.
All are subtypes of the AbstractContrasts
type:
subtypes(AbstractContrasts)
And all have fairly extensive documentation via the normal help system. For instance:
?SeqDiffCoding
The most commonly used contrasts are DummyCoding
and EffectsCoding
(which
are similar to contr.treatment
and contr.sum
in R, respectively).
We also provide HelmertCoding
and SeqDiffCoding
(corresponding to base R’s
contr.helmert
and MASS’s contr.sdiff
).
There are two ways to manually specify contrasts. First, you can specify them
directly via ContrastsCoding
. If you do, it’s good practice to specify the
levels corresponding to the rows of the matrix, although they can be omitted in
which case they’ll be inferred from the data.
For instance, here’s a weird set of contrasts for :spkr
:
cs = Matrix([-1/3 2/3]')
contr_manual = Dict(:spkr => StatsModels.ContrastsCoding(cs, levels=["old", "new"]))
mod3 = fit(MixedModel, f, kb07, contrasts=contr_manual)
(Note that the estimates and even the signs of the fixed effect βs change when we change the contrasts, but the overall log-likelihood doesn’t).
We can see that the values from the contrasts matrix we specified are plugged directly in to the fixed effects matrix, and are also used in computing the predictor for the interaction:
mod3.X[1:5, :]
A better way to specify manual contrasts is via HypothesisCoding
, where each
row of the matrix corresponds to the weights given to the cell means of the
levels corresponding to each column (see Schad et al. 2020 for more
information). This is less interesting with only two levels, so let’s look at a
scenario where we combine :spkr
and :prec
into a single, 4-level predictor,
and want to test some strange hypotheses.
kb07ex = @transform(kb07, spkr_prec = :spkr .* "-" .* :prec);
f2 = @formula(rt_trunc ~ 1 + spkr_prec + (1 | subj))
mod4 = fit(MixedModel, f2, kb07ex)
Let’s say we want to test whether the effect of :prec
depends on whether
:spkr
is old vs. new. We need one contrast to test the hypothesis that
"maintain" != "break"
for “new”, and another for “old”. That leaves one over,
to test the overall difference between “new” and “old”.
levels = ["new-break", "new-maintain", "old-break", "old-maintain"]
prec_old = (levels .== "old-break") .- (levels .== "old-maintain")
prec_new = (levels .== "new-break") .- (levels .== "new-maintain")
old_new = (abs.(prec_old) .- abs.(prec_new)) ./ 2
hypotheses = Matrix(hcat(old_new, prec_old, prec_new)')
contr_hyp = Dict(:spkr_prec => HypothesisCoding(hypotheses,
labels=["old", "(old) break", "(old) maint"]))
mod5 = fit(MixedModel, f2, kb07ex, contrasts = contr_hyp)