GAM: Find a good distribution for the monthly data sums?

Question

I am new in the GAM modelling. I would like to find a family, that will fit my response variables. I am using the sums of monthly counts of beetles, collected from the beetle traps in ~ two weeks interval (this can vary between traps and years) and from diverse locations across Germany. My datasets contains some zeros, but not too many. Also, as I have sum of counts, my data are always positives and integer. As I am moving from counts (discrete), are now my sums continuous ones?

Shortly, my dataset contains some zeros (~10% of data), and also extreme values, where many traps have very low counts. Here is how my monthly sums looks like:

I have found that Tweedie distribution can account for existing zeros and is often used in GAM models in similar studies. But when applied in my data, I almost find a perfect fit, but weird pattern exactly at low/zero values:

I am not sure, how can I account for this data chunk in my family? I have tried different combination of a, b, and theta parameters, and tested several families in mgcv package (as I will further use bam). nb family has a good fit, but again have a weird pattern at low values. Maybe you have some suggestions how can I fit my data better? Thank you!

Here is my code, using gam(y~1) to only fit the y distribution, without any predictors:

m <- gam(count_sum ~ 1,dat, family = tw(link = 'log', theta = 1.85, a=1.82,b=1.99)) 
gratia::appraise(m)

I wonder, if this can be done only by adjusting the parameters within family, or should I move to completely different family? Thank you for your thoughts.

My study design is very similar to Irregular time series data including long-term trends, and spatially varying (e.g. share of the forests in each trap surrounding). Following @Gavin Simpson comment, I expect that the trap counts depends on location (XY) and time (variation betweeen months, between years). As suggested by @Gavin Simpson, I should move from using a single distribution to use different distribution for each trap? How can that be implemented?

score 7 · Answer 1 · answered Aug 02 '22 at 10:33

7

You have "too many" zeros than your distributional assumptions can account for. This is a common occurrence and called zero inflation.

Common remedies involve using mixture distributions, such as zero-inflated Poisson or negative binomial distributions (I don't think I have ever seen a zero-inflated Tweedie - it could make sense in theory, but there would be quite a lot of parameters to estimate), or hurdle models. Take a look at our zero-inflation tag.

R has quite a number of packages that can help with zero-inflated data, like pscl::zeroinfl() or this tutorial.

answered Aug 02 '22 at 10:33

Stephan Kolassa

123,354

1

I'd say a more pressing problem is assuming that all the locations/samples have the same expected value, which seems highly unlikely given the distribution of the data and the subject area – Gavin Simpson Aug 04 '22 at 14:43
Thank you for a great example @Stephan Kolassa, I will chcek the tutorial. What I hoped for is to how to fit this new distribution within the gam formula framework (gam(y~s(x), family = pscl::zeroinf)), rather to move the whole model into pscl::zeroinf(z~s(x1) + ...). Do you think that this can be doable somehow? Both approaches seems to use different claims for model effects (random, cyclic, ..), factors, etc. and differents way to plot models. So as I am quite new in modelling, I am worried that moving from gam() to zeroinf() will results in more errors... Thank you! – maycca Aug 05 '22 at 09:17
Hm. Unfortunately, I am not all that experienced with gam. Yes, both tools use different assumptions, but my best guess would be that moving to zeroinfl altogether might be the best solution. Or follow Gavin's proposal. – Stephan Kolassa Aug 17 '22 at 07:59

score 3 · Answer 2 · answered Aug 04 '22 at 15:22

While Stephen's answer explains and discusses the potential zero-inflation in your data, there are several other considerations including your use of the tw() family and the underlying assumptions of the model you fitted.

Firstly, the tw() family allows you to not specify any of the parameters and have the underlying algorithm find the optimal power parameter for the Tweedie distribution fitted.

The other arguments a and b are only used if you want to limited the search interval over which values of the power parameter are considered. It doesn't make much sense to limit the searched interval too much like you are doing without good justification.

I think it is highly unlikely that all your sampling locations are so homogeneous in space and time that you can model them with a single distribution, Tweedie or otherwise.

You mention monthly counts (I presume this is where the sums come from - you are summing the two-weekly samples into monthly counts? - some additional detail on your setup, how you are summing etc would be useful) so perhaps the summed counts vary over time reflecting some phenological process? Are your traps all in the same location such that it is reasonable to consider them all to have the same expected count? If the environments around your traps differ such that we might expect more or fewer beetles in general, or of certain species, then using a single distribution to model all the summed counts would be too restrictive.

If it really is reasonable to model the observations with a single Tweedie (and summed counts seems to be a reasonable justification for the Tweedie, assuming that there are many such counts in each sum?) then you will likely be better off fitting the Tweedie using the Tweedie package as it considers a wider family of Tweedie distributions where the tw() family is limited to powers in the range 1–2, but zero-inflated or hurdle models would also be appropriate as Stephen suggested and consider zero-inflated or hurdle versions of the negative binomial as well as the Poisson.

Simson, thank you for your very insightfull answer. Indeed, I would like to investigate the effect of different environmental predictors on beetle counts, so I expect a strong dependence on varying local conditions, of the trapping time (varying in months, over years), and dependence on the previous year counts (likely). You suggest that this requires fitting different distributions to each trap? Please, how can this be implemented? For now, I will try to use Tweedie distribution with zero-inflated/hurdle versions, negbn and Poisson distribution. Thank you again! — maycca, Aug 05 '22 at 09:40
One fits a different distribution by adding terms to the linear predictor for the covariates you mentioned. In R and gam(), you do this via the formula; you use count_sum ~ 1 where the 1 is the intercept or constant term (which is why you fit a single distribution to all traps). If you used something like count_sum ~ s(x,y) + s(year) + s(month) you would be saying that the distribution of summed counts varies in space and over time (note the intercept is implied in here), and hence you get a different distributions for each unique combination of x,y coordinate pair, year, & month — Gavin Simpson, Aug 08 '22 at 19:37
Thank you @Gavin Simpson! indeed, now it is clearer. My whole model looks very alike to your previous post: https://stats.stackexchange.com/questions/244042/trend-in-irregular-time-series-data, so therefore I am considering different distributions. So maybe instead of trying to fit one distribution to one y (what I intended by count_sum ~ 1) it is better to fit y while using all assumed predictors? Thank you! — maycca, Aug 10 '22 at 07:57

GAM: Find a good distribution for the monthly data sums?

2 Answers2