fitting a gam with constraints on parameters to deal with separation in parametric terms

Question

I am trying to fit a gam using mgcv which has a mix of smooth and parametric terms. The model is for some count data on fish catches. I am modelling variation in location and time, but also differences among individual operators. I am interested in the differences in operators specifically, and so am treating them as a fixed effect in my model. While there are solutions for my problem outlined below in a glm context, I need to work with a gam as I have a 2 dimensional surface that is a key feature of the model.

My challenge is that some levels of the fixed effect for vessels always have responses of 0, leading them to have coefficient estimates that move to negative infinity, due to the log link. This leads to convergence failures.

I am looking for ways to solve this issue. One obvious solution is to move to a Bayesian approach, and use weakly informative priors on the fixed effects terms. I have implemented this in brms, using the same structure. This works, but the complexity of the models means that the fitting time is much longer and I have a number of data sets to analyze.

I have found a few related questions: Dealing with quasi-complete separation in General Additive Model?
How to deal with perfect separation in logistic regression?
but the first which dealt with gams wasn’t resolved, and the second didn’t appear to have any viable solutions that I haven’t tried.

I would like to find a way to address this using gams in mgcv. It seems like there should be two possibilities, using penalties of some sort on the fixed effect term or imposing a constraint on the fixed effect term. After a lot of reading, I have not been able to find a good example of a penalty approach, and I am struggling to operationalize the pcls example from the mgcv help.

The help for pcls in mgcv is https://rdrr.io/cran/mgcv/man/pcls.html and the first example is the closest to what we are trying to do.

I have included an example dataset to illustrate the problem, any suggestions are greatly appreciated.

#set up data
Lat <- runif(100,1,20)
Lon <- runif(100,1,20)
Year <- runif(100,1,10)
Vessel <- rep((1:5)/10,each = 20)
Lambda <- exp(0.1*Lat - 0.01*Lat^2 + 0.1*Lon + 0.01*Lon^2 + 0.05*Year + Vessel)
Vessel <- as.factor(Vessel)
Catch <- apply(matrix(Lambda),1,rpois,n=1)
dt <- data.frame(Lat,Lon,Year,Vessel,Catch)
#now fit gam
M <- gam(Catch ~ s(Lat,Lon) + s(Year) + Vessel, family = tw, data = dt)
#create separation, all observations zero for one vessel
dt$Catch[dt$Vessel == levels(dt$Vessel)[2]] <- 0
#now fit gam again with separation
M <- gam(Catch ~ s(Lat,Lon) + s(Year) + Vessel, family = tw, data = dt)

I would consider using mgcv random effects. They are fit with penalized likelihood.
If you are excluding that approach out of concern for your p-values, consider that any other solution to your problem will also affect the validity of your p-values, and there are other ways to get p-values like bootstrap. — Paul, Oct 22 '22 at 02:21
Thanks Paul. The issue is that I am interested in the estimates of the vessel coefficients themselves, so a random effect isn't appropriate. — chris, Oct 29 '22 at 01:55

Gavin Simpson · Answer 1 · 2022-10-22T19:31:35.913

I think you'd be better off putting a ridge penalty on the Vessel parametric term using the paraPen argument to gam() rather than trying to squish the whole model in pcls().

library("mgcv")
#set up data
set.seed(1) # repeatable
Lat <- runif(100,1,20)
Lon <- runif(100,1,20)
Year <- runif(100,1,10)
Vessel <- rep((1:5)/10,each = 20)
Lambda <- exp(0.1*Lat - 0.01*Lat^2 + 0.1*Lon + 0.01*Lon^2 + 0.05*Year + Vessel)
Vessel <- as.factor(Vessel)
Catch <- apply(matrix(Lambda),1,rpois,n=1)
dt <- data.frame(Lat,Lon,Year,Vessel,Catch)
create separation, all observations zero for one vessel
dt$Catch[dt$Vessel == levels(dt$Vessel)[2]] <- 0
now fit gam with separation problem
m1 <- gam(Catch ~ s(Lat,Lon) + s(Year) + Vessel, family = tw, data = dt)

With the paraPen argument we can set whatever penalty you want on the individual parametric terms. A ridge penalty is an identity penalty matrix, so we can do

pp <- list(Vessel = list(rank = 4, diag(4)))
m2 <- gam(Catch ~ s(Lat,Lon) + s(Year) + Vessel, family = tw, data = dt,
          paraPen = pp, method = "REML")

We need a rank 4 penalty because the Vessel term will become 4 columns in the model matrix with the default contrasts

> contrasts(dt$Vessel)                                                        
    0.2 0.3 0.4 0.5
0.1   0   0   0   0
0.2   1   0   0   0
0.3   0   1   0   0
0.4   0   0   1   0
0.5   0   0   0   1

so we need an identity matrix that is 4x4.

Now the coefficient is reasonable and the Wald-like test indicates the importance of this term (though note Simon's warning in ?summary.gam about the default p values for paraPen terms, so we might use freq = TRUE in the summary call):

summary(m2, freq = TRUE)

> summary(m2, freq = TRUE)
Family: Tweedie(p=1.01) 
Link function: log
Formula:
Catch ~ s(Lat, Lon) + s(Year) + Vessel
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept)  2.61557    0.06388  40.947  < 2e-16 ***
Vessel0.2   -7.88552    1.41870  -5.558 3.66e-07 ***
Vessel0.3    0.16667    0.05882   2.833  0.00586 ** 
Vessel0.4    0.26836    0.05867   4.574 1.77e-05 ***
Vessel0.5    0.36089    0.06286   5.741 1.72e-07 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
              edf Ref.df      F p-value

s(Lat,Lon) 15.385 19.139 259.80  <2e-16 ***
s(Year)     1.712  2.125  24.76  <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.996   Deviance explained = 99.4%
-REML = 242.41  Scale est. = 0.99247   n = 100

However this is just a common-or-garden random intercept so we can do this much more easily using the random effect basis

m3 <- gam(Catch ~ s(Lat,Lon) + s(Year) + s(Vessel, bs = "re"),
          family = tw, data = dt, method = "REML")
summary(m3)

> summary(m3)
Family: Tweedie(p=1.01) 
Link function: log
Formula:
Catch ~ s(Lat, Lon) + s(Year) + s(Vessel, bs = "re")
Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.192      1.870   0.637    0.526
Approximate significance of smooth terms:
              edf Ref.df      F p-value

s(Lat,Lon) 15.383 19.138 259.81  <2e-16 ***
s(Year)     1.716  2.131  24.73  <2e-16 ***
s(Vessel)   3.890  4.000  16.58  <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.996   Deviance explained = 99.4%
-REML = 242.86  Scale est. = 0.99247   n = 100

We can extract the random effects relatively easily:

library("gratia")
smooth_coefs(m3, "s(Vessel)")

> smooth_coefs(m3, "s(Vessel)")                                               
s(Vessel).1 s(Vessel).2 s(Vessel).3 s(Vessel).4 s(Vessel).5 
   1.424014   -6.489148    1.589729    1.691450    1.783955

remembering those are about the overall model constant term (the (Intercept) bit). And they match (approximately) the estimates from the paraPen model:

> sc <- smooth_coefs(m3, "s(Vessel)")
> sc[1] - sc[-1]                                                              
s(Vessel).2 s(Vessel).3 s(Vessel).4 s(Vessel).5 
  7.9131616  -0.1657155  -0.2674363  -0.3599419

Good to see you pop in :) I was wondering about paraPen, but my intuition was that there'd be no meaningful difference from random effects. Sounds like you agree? I wonder what's driving the small difference in outcome here. Probably the contrasts? Would paraPen would get the same result as RE if you forced it to use identity contrasts rather than the default treatment contrasts? — Paul, Oct 24 '22 at 12:04
I doubt the difference is statistically or biological important, but yes, I suspect the cause of the difference is the difference in parameterization. But I wouldn't switch the parameterization just to get the two to conform exactly, I'd just use the ranef smooth basis version. — Gavin Simpson, Oct 24 '22 at 12:08
I commented on Paul's random effect suggestion above, my understanding had been that inference on the terms emerging from the re model was bad practice, as the focus of the estimation is on the variance in the random effect term, not the means. An unrelated issue is that the vessel level estimates can be significantly non-normal, so a model treating the vessel effect as random effect is worse than one using a fixed effect based on AIC. I am assuming this is due to the constraint of a shared variance, as opposed to allowing the estimates of the mean to move to their best values. — chris, Oct 29 '22 at 01:58
For those interested in how to constrain parameters, just for completeness I was able to solve the problem. — chris, Oct 29 '22 at 02:26
@chris I think that advice stems from "old style" random effects. With "new style" random effects (which include splines) we are just using the random effect as a computational tool to get estimates of the things we want. I don't think the model forces the estimates random effects to be Gaussian - a QQ-plot of the ranefs often shows strong departures from the Gaussian assumption. AIC is one metric for assessing models - if you're interested in prediction I would look at AIC, but for inference I would choose the model based on other considerations. — Gavin Simpson, Oct 29 '22 at 11:44
@GavinSimpson thanks for the clarification. I take that to mean that they do not rely on calculating a likelihood (or some proxy thereof) assuming N(0,sigma(i)). I thought that was the issue, the focus was on estimating the variance given some grouping variable (i in this case), and that there was a central position for each group that allowed a mean of 0 around that central position, but that one could not treat that as if it was N(mu(i), sigma(i)). So, you cannot interpret the mu(i) as an estimated parameter. — chris, Nov 01 '22 at 22:43
@GavinSimpson To be honest, I have always found this confusing, as if one fit the same model in a bayesian framework, to my knowledge there is no issue with fitting N(mu(i),sigma(i)) and interpreting both parameters as estimates. I'd be curious to read a bit about it if you have any suggestions for good references from the frequentist side. — chris, Nov 01 '22 at 22:45
@chris When you fit with s(f, bs = "re") you are using exactly the same justification for estimating the smooth functions in your model, like s(Year). We do not believe that the coefs for s(Year) are truly draws from a population of values (as one would for old-style ranefs) yet we can use ranefs to estimate them. You can't have it both ways; you either have to reject the estimation of the smooth of Year if you reject the s(f, bs = "re"), or you accept them both. This is what I mean by new-style random effects. We are using them as a computational vehicle to estimate what we want — Gavin Simpson, Nov 02 '22 at 13:23
These ideas are expressed in this preprint by Hodges and Clayton http://www.biostat.umn.edu/~hodges/PubH8492/Hodges-ClaytonREONsubToStatSci.pdf which never ended up being published as the ideas went into Hodges book (Richly parameterized models). Simon Wood also discusses this in the second edition of his book. — Gavin Simpson, Nov 02 '22 at 13:34
@chris reading my comment above back this morning it comes across as argumentative or combative. I intended "you can't have it both ways" to be lighthearted or humorous, pointing out that smooths and ranefs are two sides of the same coin. Apologies for that. Basically, what i was trying to say is that if you accept the "estimates" for the smooths (which were done by treating the coefs for wiggly basis functions as random effects) then you also have to accept the "estimates" for the random effects of 'Vessel'. The Hodges & Clayton preprint goes into these ideas in detail. — Gavin Simpson, Nov 03 '22 at 08:48
@GavinSimpson haha.... I didn't take it as combative at all, I really appreciate your time. Thanks for passing along the reference, I will have a look at it. Its been a long time since I read simon's book, but sounds like it might be time to pick it back up again. — chris, Nov 04 '22 at 00:55

chris · Answer 2 · 2022-10-29T02:54:42.660

The suggestion from Simon works a treat. Just for completeness I thought I would also provide the answer for how to make the constrained approach to the problem work.

Lat <- runif(100,1,20)
Lon <- runif(100,1,20)
Year <- runif(100,1,10)
Vessel <- rep((1:5)/10,each = 20)
Lambda <- exp(0.1*Lat - 0.01*Lat^2 + 0.1*Lon + 0.01*Lon^2 + 0.05*Year + Vessel)
Vessel <- as.factor(Vessel)
Catch <- apply(matrix(Lambda),1,rpois,n=1)
dt <- data.frame(Lat,Lon,Year,Vessel,Catch)
#create separation, all observations zero for one vessel
dt$Catch[dt$Vessel == levels(dt$Vessel)[2]] <- 0
#now fit gam again with separation
M <- gam(Catch ~ s(Lat,Lon) + s(Year) + Vessel, family = tw, data = dt)

Now we want to impose constraints on the parametric terms for the vessel effect

#get length of fixed effects coefficients and all coefficients
coefs <- coef(M)
NG <- nlevels(dt$Vessel) - 1  
NC <- length(coefs)

Set up a model but don't fit it, so we can build the necessary inputs

mf <- Catch ~ s(Lat,Lon) + s(Year) + Vessel
mgam <- gam(mf, family=poisson, data=dt, fit=FALSE)

inequality constraints: all vessel coefficients must be > 0. We want zeros in all positions not occupied by terms we want to constrain.

mgam$Ain <- cbind(matrix(0, NG, 1),
                  diag(NG),
                  matrix(0, NG, NC - (1 + NG)))
 #this should match the no columns in design matrix wide, and the number of constrained parameters in rows
mgam$bin <- rep(-100, NG)
mgam$sp <- M$sp
mgam$p <- coefs
mgam$p[2:(NG+1)] <- 1 # make initial parameters values satisfy constraints
mgam$C <- matrix(0, 0, 0)
#fit with pcls and save the output
p <- pcls(mgam)

Nice; just be aware that this is doing a least squares fit, so you'll have to accept using some transformation of the response to make it usable with your fisheries catch data — Gavin Simpson, Oct 29 '22 at 11:39

fitting a gam with constraints on parameters to deal with separation in parametric terms

2 Answers2

create separation, all observations zero for one vessel

now fit gam with separation problem

Linked