How to trust results of GAM in R

Question

EDITED: I'm new to GAM's and am looking for some advice on if I can trust my model or not. My dataset (n=200) is examining how a whole suite of environmental variables (15) impact the likelihood of coral bleaching. My data are very positively skewed with lots of zero's (77), because there are many situations where bleaching has been checked for and reefs have been found to be in healthy, non-bleached condition. The rest of the data range from 0.25 to 100 as a proportion of the coral colonies that are bleached. Here's a brief subset with latitude, longitude, latitude relative to the equator, average proportion of coral colonies bleached, proportion bleached scaled to 1, the proportion of seafloor covered in coral, followed by some metrics for warming events.

I have run a bunch of GAMs in R using variations of the below code:

Bleach_GAM.1 = gam(Bl_scaled ~ s(dist_eq) + s(rate_decline) + s(Total_dur)+Av_cover+ 
                  Dur + i_mean + i_max + i_cum+rate_onset +
                  Dur_max + i_max_max + above_thresh + total_icum, 
                data = GAM_Bleaching, family = quasibinomial)
    summary(Bleach_GAM.1)

I have found if I specify family I achieve results similar to the below:

Family: quasibinomial 
Link function: logit
Formula:
Bl_scaled ~ s(dist_eq) + s(rate_decline) + s(Total_dur) + Av_cover + 
    Dur + i_mean + i_max + i_cum + rate_onset + Dur_max + i_max_max + 
    above_thresh + total_icum
Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)

(Intercept)    5.070957   2.567574   1.975 0.049884 *

Av_cover       0.005532   0.002944   1.879 0.061930 .

Dur           -1.227598   0.155837  -7.877 3.76e-13 ***
i_mean        -3.977210   1.033315  -3.849 0.000168 ***
i_max        -12.843586   1.745832  -7.357 7.61e-12 ***
i_cum          0.710637   0.114201   6.223 3.67e-09 ***
rate_onset    -8.488058   0.694178 -12.227  < 2e-16 ***
Dur_max        0.507960   0.051236   9.914  < 2e-16 ***
i_max_max     15.138327   1.850254   8.182 6.22e-14 ***
above_thresh   1.698448   0.479665   3.541 0.000514 ***
total_icum    -0.523578   0.095398  -5.488 1.45e-07 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
                  edf Ref.df      F p-value

s(dist_eq)      8.943  8.997  38.60  <2e-16 ***
s(rate_decline) 1.000  1.000 110.21  <2e-16 ***
s(Total_dur)    8.698  8.971  24.55  <2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.669   Deviance explained = 71.2%
GCV = 0.18404  Scale est. = 0.032065  n = 200

I then plotted the model against a predicted dataset based on the original data using the below code:

pred = predict(Bleach_GAM.1, newdata = GAM_Bleaching[,-6])
x = 1:nrow(GAM_Bleaching)
plot(x, GAM_Bleaching$Bl_scaled, col="blue", type = "l")
lines(x, pred, col="red", type = "l" )
legend("bottomleft", legend=c("y-fitted", "y-origianl"),
       col=c("red", "blue"), lty=1, cex=0.7)

Which resulted in the following plot. As you can see, the modelled data bear no resemblance to the predicted dataset.

If I do not specify 'family' in the terms of the model, the best results I can achieve are (unsurprisingly) worse:

R-sq.(adj) =   0.48   Deviance explained = 54.4%
GCV = 424.66  Scale est. = 370.23    n = 200

but the plotted data are better:

NOTE: I realise not specifying family is the wrong approach with heavily skewed data, but I am confused about why this model gives a better likeness to the predicted dataset.

Does this mean that the model using quasibinomial is actually is a poor fit? Or is there a better way to check a model is a good fit for the data?

Any advice gratefully received!

PS: sorry, this is the first time I have posted a question. If it would be beneficial to give an example of the code I am using, please let me know!

What's critical here (as in all modeling) is the nature of your outcome variable. Is it simply a binary bleached or not-bleached outcome, or are there levels of bleaching that you evaluate? Of your 200 data points, how many are bleached? How many environmental variables are you modeling? Please provide that information by editing the question, as comments are easy to overlook and can be deleted. — EdM, Jul 01 '22 at 14:18
It does not make any sense to not specify family in the model! If not specified, the default value is used, which is the normal distribution family, which probably does not make much sense for your case — kjetil b halvorsen, Jul 01 '22 at 14:58
Thanks EdM and Kjetil b halvorsen for comments, I have edited the orignial post. — ktsmith39, Jul 02 '22 at 09:59
Yes, to get good advice you should provide specifics; otherwise you'll get advice about best practices in general. Relevant information includes: a sample and description of your data, the models that you are considering, the code you use to create the visualizations, and last but not least the plots themselves, so that we understand what you mean by reasonable likeliness. — dipetkov, Jul 02 '22 at 14:45
Also please explain what you mean with: "plot my modelled data against a predicted dataset based on the original". This suggests that you have another model -- or more likely a heuristic -- to simulate data. This changes the situation quite a bit. — dipetkov, Jul 02 '22 at 14:48
Two thoughts on the plots. First, it's more helpful to order the cases by the predicted values, then plot one point for each observed versus predicted value. Just putting the cases in the same order as in your dataframe, as you did, makes it hard to see things like trends in deviations from predictions. — EdM, Jul 04 '22 at 13:26
Second, the predict() function applied to the gam object with default settings gives results in terms of the linear-predictor values, which are not necessarily the response values. For the Gaussian family (the default when you don't specify a family), that's the same as the response. For the quasibinomial model the linear predictor is not transformed to the response scale by default, so that plot isn't appropriate for evaluating the model fit. — EdM, Jul 04 '22 at 13:30
I added a bit to the answer to reflect the new information in the question. — EdM, Jul 04 '22 at 14:00
Thank you EdM - very much appreciate your thoughts and input! — ktsmith39, Jul 04 '22 at 14:03

EdM · Accepted Answer · 2022-07-04T17:55:48.857

Your original data are counts of bleached versus unbleached colonies. It's typically best to work close to the original data. In this case, you could use a binomial model based on those counts directly. That takes into account the fact that a proportion based on 10 colonies is a lot less reliable than a proportion based on 1000 colonies, something that the proportions themselves don't represent.

With glm() in R, you can use a 2-column matrix of outcomes for such data, one for the number of "successes" (e.g., bleached colonies) and one for the number of "failures" (e.g., not-bleached colones). Alternatively, you can specify the fraction of bleached colonies and provide the total number of colonies examined in a weights argument to the generalized linear model. A quick check of the gam() function in the R mgcv package indicates that such handling of observation numbers works there, too. The way you have coded the data, omitting weights, each row of your data is weighted equally. Make sure that is fair treatment of your data.

Pay attention to whether you might be close to overfitting your data with so many predictors. The coefficients for variables in smooths are penalized by the model fitting. But unless you specify otherwise, the other "parametric" terms aren't penalized. There's a danger that you might end up with an overfitted model that fits your data but doesn't generalize well. With about 19 degrees of freedom used up in the smooths, according to the model summary, and 10 more used up for the parametric coefficients, you've used up 29 degrees of freedom with 200 observations. That's only 7 observations per degree of freedom, perhaps a bit on the low side. Consider penalizing some of the "parametric" coefficients as well.

In terms of visualizing model-predicted versus observed values, it wasn't originally clear what plots you generated. Now that the plots have been added to the question, it's clear that the plot for the quasibinomial model isn't appropriate. The output of the predict() function applied to a gam object (with the implicit defaults implied by the code) is in terms of the linear predictor, not in terms of the response. What you get with the linear predictor is the logit of the response with a quasibinomial model. So the observed and default "predicted" shouldn't agree for the quasibinomial model. For the Gaussian model (when no family is specified), the linear predictor is the same as the predicted response.

For visualization, the R gratia package, designed to complement gam() models, might be helpful; the package author frequently visits this site. For further guidance, start with the nearly 800 questions tagged on this site with generalized-additive-model and refine the search with the things you'd like to learn more about.

Thanks EdM - really appreciate all your input. To clarify, I think that you are suggesting to increase the number of variables in smooths? Is this right? I'm playing with the model adding other terms to smooths, and it is vastly increasing my R-sq and deviance explained. I wanted to check I am doing what you mean though? — ktsmith39, Jul 04 '22 at 14:28
To give you and idea......
Formula: Bl_scaled ~ s(dist_eq) + s(rate_decline) + s(Total_dur) + Av_cover + Depth + s(Dur) + s(i_max) + s(rate_onset) + Dur_max + i_max_max + above_thresh — ktsmith39, Jul 04 '22 at 14:32
Output..... R-sq.(adj) = 0.832 Deviance explained = 83.4% GCV = 0.13092 Scale est. = 0.047091 n = 200 — ktsmith39, Jul 04 '22 at 14:33
@ktsmith39 you might find the links from Simon Wood's mccv page to be helpful. — EdM, Jul 04 '22 at 15:30
@ktsmith39 Gavin Simpson suggests using smooths with small numbers of knots for penalizing otherwise parametric coefficients, although there is a paraPen argument that can penalize them as linear. See this page and this page. I fear you might otherwise overfit your current data with what seem to be superb Rsq etc, but get a model that doesn't fit new data well. Add up the edf values for the smooths plus the number of parametric coefficients: are they much beyond 20-30 in total? — EdM, Jul 04 '22 at 16:08
thank you so much for all this guidance! Yes - the earlier model I ran amounted to combined edf's far greater than 30, so I have cut back to smoothing 1-3 parameters only and am playing around with different k values. I have tried the select function and will have a look at the paraPen function. — ktsmith39, Jul 04 '22 at 19:08

How to trust results of GAM in R

1 Answers1