EDITED: I'm new to GAM's and am looking for some advice on if I can trust my model or not. My dataset (n=200) is examining how a whole suite of environmental variables (15) impact the likelihood of coral bleaching. My data are very positively skewed with lots of zero's (77), because there are many situations where bleaching has been checked for and reefs have been found to be in healthy, non-bleached condition. The rest of the data range from 0.25 to 100 as a proportion of the coral colonies that are bleached. Here's a brief subset with latitude, longitude, latitude relative to the equator, average proportion of coral colonies bleached, proportion bleached scaled to 1, the proportion of seafloor covered in coral, followed by some metrics for warming events.
I have run a bunch of GAMs in R using variations of the below code:
Bleach_GAM.1 = gam(Bl_scaled ~ s(dist_eq) + s(rate_decline) + s(Total_dur)+Av_cover+
Dur + i_mean + i_max + i_cum+rate_onset +
Dur_max + i_max_max + above_thresh + total_icum,
data = GAM_Bleaching, family = quasibinomial)
summary(Bleach_GAM.1)
I have found if I specify family I achieve results similar to the below:
Family: quasibinomial
Link function: logit
Formula:
Bl_scaled ~ s(dist_eq) + s(rate_decline) + s(Total_dur) + Av_cover +
Dur + i_mean + i_max + i_cum + rate_onset + Dur_max + i_max_max +
above_thresh + total_icum
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.070957 2.567574 1.975 0.049884 *
Av_cover 0.005532 0.002944 1.879 0.061930 .
Dur -1.227598 0.155837 -7.877 3.76e-13 ***
i_mean -3.977210 1.033315 -3.849 0.000168 ***
i_max -12.843586 1.745832 -7.357 7.61e-12 ***
i_cum 0.710637 0.114201 6.223 3.67e-09 ***
rate_onset -8.488058 0.694178 -12.227 < 2e-16 ***
Dur_max 0.507960 0.051236 9.914 < 2e-16 ***
i_max_max 15.138327 1.850254 8.182 6.22e-14 ***
above_thresh 1.698448 0.479665 3.541 0.000514 ***
total_icum -0.523578 0.095398 -5.488 1.45e-07 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(dist_eq) 8.943 8.997 38.60 <2e-16 ***
s(rate_decline) 1.000 1.000 110.21 <2e-16 ***
s(Total_dur) 8.698 8.971 24.55 <2e-16 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.669 Deviance explained = 71.2%
GCV = 0.18404 Scale est. = 0.032065 n = 200
I then plotted the model against a predicted dataset based on the original data using the below code:
pred = predict(Bleach_GAM.1, newdata = GAM_Bleaching[,-6])
x = 1:nrow(GAM_Bleaching)
plot(x, GAM_Bleaching$Bl_scaled, col="blue", type = "l")
lines(x, pred, col="red", type = "l" )
legend("bottomleft", legend=c("y-fitted", "y-origianl"),
col=c("red", "blue"), lty=1, cex=0.7)
Which resulted in the following plot. As you can see, the modelled data bear no resemblance to the predicted dataset.

If I do not specify 'family' in the terms of the model, the best results I can achieve are (unsurprisingly) worse:
R-sq.(adj) = 0.48 Deviance explained = 54.4%
GCV = 424.66 Scale est. = 370.23 n = 200
but the plotted data are better:

NOTE: I realise not specifying family is the wrong approach with heavily skewed data, but I am confused about why this model gives a better likeness to the predicted dataset.
Does this mean that the model using quasibinomial is actually is a poor fit? Or is there a better way to check a model is a good fit for the data?
Any advice gratefully received!
PS: sorry, this is the first time I have posted a question. If it would be beneficial to give an example of the code I am using, please let me know!

familyin the model! If not specified, the default value is used, which is the normal distribution family, which probably does not make much sense for your case – kjetil b halvorsen Jul 01 '22 at 14:58predict()function applied to thegamobject with default settings gives results in terms of the linear-predictor values, which are not necessarily the response values. For the Gaussian family (the default when you don't specify afamily), that's the same as the response. For the quasibinomial model the linear predictor is not transformed to the response scale by default, so that plot isn't appropriate for evaluating the model fit. – EdM Jul 04 '22 at 13:30