Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments on JOSS paper #296

Open
dill opened this issue Jul 15, 2024 · 7 comments
Open

Comments on JOSS paper #296

dill opened this issue Jul 15, 2024 · 7 comments

Comments

@dill
Copy link
Contributor

dill commented Jul 15, 2024

(Part of review here.)

This was fun to read and I learnt some things :)

Some minor statistical/editorial comments on the paper:

  • The author name is shown as "¿citation_author?" in the pdf on the gratia github. Not sure if this is normal.
  • The formulation for the GAM (first display equation) doesn't allow for multivariate smooths. This is always a pain, and I've struggled to come up with a general form. In the end I used what Simon Wood does in his book and just give an example of what can be done than trying to write a general definition. My version of this is in Miller (2019), in case that's helpful. If you did adapt this bit, there would be some follow-on changes needed in the subsequent text (though I think just giving an example of a univariate spline in the following text is totally appropriate).
  • In addition you only talk about scale parameters as the other parameters of the distribution. That's fine for exponential family distributions, but for the extended exponential case (as you show with the twlss() model), there are more parameters to consider. You might not want to address this in the intro -- that's fine!
  • "which, with the default penalty, is the integrated squared second derivative" -- I don't think this is quite right. Do you mean the default basis-penalty combination, thin plate regression splines? (As you go on to talk about.)
  • Is Figure 1a the basis functions or the basis functions multiplied by their corresponding coefficients? I think it's the latter?
  • It would be nice to have a legend in Figure 1 so readers can relate the basis functions to the values in the penalty matrix (and e.g., see that the most wiggly function has the highest penalty)? Open to push-back on this, if you want to just give default plots from gratia.
  • "GAMs fitted by mgcv are an empirical Bayesian model with an improper multivariate normal prior on the basis function coeficients." Is that always the case? One might argue that the intercept is always in the nullspace of every penalty, but one could remove that from the model and have a proper MVN prior? Could just put the improper in brackets with the word "usually"?
  • In previous snippet, coefficients, not "coeficients"
  • Miller (2019) should be Miller (2021) -- I updated it on arXiv.
  • "The response is assumed to be conditionally distributed Tweedie" -> "The response is assumed to be conditionally Tweedie distributed"?
  • "Model coefficients and smoothing parameters are estimated using restricted maximum likelihood (Wood, 2011)" missing period at end of sentence?
  • Explanation of ctrl <- gam.control(nthreads = 10)?
  • "which show significant heteroscedasticity and departure from the condtional distribution of the response given the model" -- could highlight that we see this due to the increasing spread in the deviance residuals wrt values of the linear predictor in the top right plot?
  • In the previous snippet "condtional" -> "conditional"
  • "given the absence of important effects in the model" not sure what this means?
  • In the Tweedie LSS model lead-in, it looks like you're using \varphi rather than \phi for the scale parameter (as you did in the intro), not sure if this was intentional?
  • "little data wrangling, we can produce an uncertaintry estimate ,using fitted_samples() to" misplaces comma.
  • In the references:
    • Not sure how the JOSS wants you to style "brms" but "Brms" looks weird.
    • "Duchon, J. (1977). Splines minimizing rotation-invariant semi-norms in sobolev spaces" Sobolev should be capitalised.
    • "Wood, S. N. (2003). Thin plate regression splines: Thin plate regression splines." I think it just has one title? ;)
@gavinsimpson
Copy link
Owner

The author name is shown as "¿citation_author?" in the pdf on the gratia github. Not sure if this is normal.

This seems to be OK in version generated by the review system itself (I think I have something wrong in the YAML meta data for the the citation_author element which is causing the problem there.) So I'll leave this just now (beyond tweaking the citation_author in the YAML) and figure it out if the issue presents later.

@dill
Copy link
Contributor Author

dill commented Jul 16, 2024

Further to our conversation, some thoughts on the improper prior issue:

# get some data
library(mgcv)
set.seed(2)
dat <- gamSim(1,n=400,dist="normal",scale=2)

# tprs has a null space dimension of 1 -- cool
gam(y~s(x0), data=dat)$smooth[[1]]$null.space.dim

# ts has a null space dimension of 0 -- also cool
gam(y~s(x0, bs="ts"), data=dat)$smooth[[1]]$null.space.dim

# what about B-splines with a continuous penalty? hmmm
gam(y~s(x0, bs="bs"), data=dat)$smooth[[1]]$null.space.dim


# now using code from the paper ;)
library(gratia)
library(dplyr)
library(ggplot2)

# basis get
bfs <- basis(gam(y~s(x0, bs="bs"), data=dat), select = "s(x0)")

 # plot basis functions and fitted spline
bfs |>
   ggplot(aes(x = x0, y = .value)) +
   geom_line(aes(colour = .bf)) +
   guides(colour = "none") +
   labs(x = "x", y = "y", title = "s(x)")

Well that gives us this:

Screenshot_2024-07-16_15-59-49

which I don't see any "linear" terms in... ¯_(ツ)_/¯

Going back to the boring situation in which we "read the documentation", in ?smooth.construct we have:

null.space.dim: The dimension of the penalty null space (before centering).

So I think this is before identifiability constraints are applied (since they need to be applied at the model, not term level. So I think the nullspace just consists of the intercept in that case.

Doing a tiny bit more digging:

jagam(y~s(x0, bs="bs"), data=dat, centred=FALSE, file="m1.jags")
jagam(y~s(x0, bs="bs")-1, data=dat, centred=FALSE, file="m0.jags")

Produces:

model {
  mu <- X %*% b ## expected response
  for (i in 1:n) { y[i] ~ dnorm(mu[i],tau) } ## response 
  scale <- 1/tau ## convert tau to standard GLM scale
  tau ~ dgamma(.05,.005) ## precision parameter prior 
  ## prior for s(x0)... 
  K1 <- S1[1:10,1:10] * lambda[1]  + S1[1:10,11:20] * lambda[2]
  b[1:10] ~ dmnorm(zero[1:10],K1) 
  ## smoothing parameter priors CHECK...
  for (i in 1:2) {
    lambda[i] ~ dgamma(.05,.005)
    rho[i] <- log(lambda[i])
  }
}
model {
  mu <- X %*% b ## expected response
  for (i in 1:n) { y[i] ~ dnorm(mu[i],tau) } ## response 
  scale <- 1/tau ## convert tau to standard GLM scale
  tau ~ dgamma(.05,.005) ## precision parameter prior 
  ## Parametric effect priors CHECK tau=1/79^2 is appropriate!
  for (i in 1:1) { b[i] ~ dnorm(0,0.00016) }
  ## prior for s(x0)... 
  K1 <- S1[1:10,1:10] * lambda[1]  + S1[1:10,11:20] * lambda[2]
  b[2:11] ~ dmnorm(zero[2:11],K1) 
  ## smoothing parameter priors CHECK...
  for (i in 1:2) {
    lambda[i] ~ dgamma(.05,.005)
    rho[i] <- log(lambda[i])
  }
}

So I think that makes sense? (Running to stand in front of my poster so I will think about this more in a wee bit!)

@gavinsimpson
Copy link
Owner

Going back to the boring situation in which we "read the documentation", in ?smooth.construct we have:

null.space.dim: The dimension of the penalty null space (before centering).
So I think this is before identifiability constraints are applied (since they need to be applied at the model, not term level. So I think the nullspace just consists of the intercept in that case.

But for that to be true, wouldn't the null space dimension for a TPRS (the default) be = 2 because it has both constant and linear basis functions, yet as you show it is 1.

If we do

library("mgcv")
library("gratia")

df <- data_sim("gwf2", seed = 2)
m <- gam(y ~ s(x, bs = "cr"), data = df)

# extract penalty
S <- m$smooth[[1]]$S[[1]]

# eigen decompoisition
eigen(S)

producing

> eigen(S)
eigen() decomposition
$values
[1]  2.201890e+00  1.420716e+00  8.582862e-01  4.449231e-01  1.969655e-01
[6]  7.046850e-02  1.825633e-02  2.373200e-03 -6.021430e-17

where 1 eigenvalue is numerically 0 and the matrix is not invertible by normal means.

$> solve(S)
Error in solve.default(S) :
  system is computationally singular: reciprocal condition number = 4.66992e-19

And this is what I would expect, because a linear function is still in the span of basis, even if there is no actual linear function.

Now consider the same model but with select = TRUE, where we include an additional penalty on the null space:

m <- gam(y ~ s(x, bs = "cr"), data = df, select = TRUE)

Now the null space of this smooth is 0, as it should be because of the extra penalty:

> m$smooth[[1]]$null.space.dim
[1] 0

Some more discussion is needed here...

@gavinsimpson
Copy link
Owner

gavinsimpson commented Jul 16, 2024

Some further reading has helped, if we're just talking about the Null space of a penalty. The key point I think is that the null space is the span of functions representable by the basis for which the penalty has no effect (and hence it's size is the number of such representable functions that are unaffected by the penalty). The null space doesn't refer to specific basis functions but to the functions representable by that basis.

Section 5.4.3 of Simon's GAM book has:

BB0B45E9-91FB-4F35-B194-73D0F06C1446

which I think is clear. This explains the findings in my comment above. I have assumed that the rank deficiency of $S$ is what leads to the improper prior when we view penalized smoothing from a Bayesian perspective. But I should read a little more about that...

@gavinsimpson
Copy link
Owner

Here's some clearer evidence that the smooths have a null space. This needs the very latest version of gratia from github but equivalent results can be achieved with mgcv::smoothCon(..., diagonal.penalty = TRUE, absorb.cons = TRUE).

We can rotate the basis functions using an eigendecomposition to reparameterize the smooth such that it has a diagonal (identity) penalty. The null space is then highlighted because the diagonal elements are zero for the null space.

remotes::install_github("gavinsimpson/gratia")
library("gratia")
library("patchwork")

df <- data_sim("gwf2", seed = 2)

p1 <- basis(s(x, bs = "bs"), data = df,
    constraints = TRUE, diagonalize = TRUE) |>
  draw()

p2 <- penalty(s(x, bs = "bs"), data = df,
    constraints = TRUE, diagonalize = TRUE) |>
  draw()

p1 + p2 + plot_layout(ncol = 2, nrow = 1)

penalty-null-space-fig

Here, we're using a B-spline basis and have absorbed the sum-to-zero identifiability constraint into the basis and then reparameterized to diagonalize the penalty. Absorbing the identifiability constraint means the constant / zero function is not in the span of the basis. But the linear functions remain in the span of the basis. This is clear from the reparametrisation shown above, where we see an actual linear basis function and a zero row/column in the penalty matrix.

The reparameterisation doesn't change anything with respect to the original basis, it just highlights the null space of the basis more clearly.

@dill
Copy link
Contributor Author

dill commented Jul 29, 2024

After some extensive chat here, online and offline (by nice coincidence at the same conference), we've got somewhere on this (the improper prior business)... I'll write this collectively for both our findings but if @gavinsimpson needs to correct me anywhere, I trust that will happen.

  • if we have a 2nd order derivative penalty, functions of the form intercept + slope are in the null space of the penalty
  • that doesn't mean that intercept + slope have to be explicitly in the basis definition, just that whenever the spline looks like a slope, it will not be penalized.
    • you can observe this by taking a B-spline basis (which doesn't have a slope term) and fitting a model where you specify a numerically infinite smoothing parameter (sp=10e10, e.g.) -- we have a straight line, even though we don't have an explicit slope term
  • when we absorb the intercept for each term into the "global" intercept (which we must do for identifiability in the non-fully Bayesian, informative priors case) we remove that intercept from each term, BUT there still remains the slope null space
  • improper priors occur when we have non-finite variances (amongst other reasons), these occur when we have approximately zero penalties (since the covariance is the inverse of the penalty, large variance -> small penalty) which occurs when we have a rank deficient penalty -- when there is a null space.
  • there is a little subtlety about whether we talk about the prior for an individual smooth term vs. the multivariate normal prior for the whole model (taking a kind of Gaussian process view of the model, each bit is Gaussian, so the whole is one big Gaussian).
  • there are some limited cases where we have a proper prior on individual smooths, for example:
    • when using the ts and cs bases (where there is a small penalty added to the null space terms)
    • when using select=TRUE additional penalty matrices for the null space terms
    • when using a lower-order penalty, with certain bases (tp, ps, bs, maybe others), where the slope is not in the null space and the intercept is removed with absorbing constraints.

Does that cover everything @gavinsimpson?

@dill
Copy link
Contributor Author

dill commented Jul 29, 2024

Then two additional comments because @gavinsimpson mistakenly asked me to look more closely at his equations...

  • Generally, I think folks use $\mathcal{L}$ to be the likelihood at define $\ell = \log_e\mathcal{L}$, so it might be "better" to write your $\ell$ as $\ell_p$ or somesuch?

(this list interrupted by my realising that github now renders $\LaTeX$, wow! That's going to make the next bit easier...)

  • on combining the $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}_j$ into $\boldsymbol{\beta}$: I think this is fine, you can say you do that and say that $\boldsymbol{\beta}_j$ refers to the appropriate parameters relating to term $j$ in the model, then I think that's clear enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants