reasonable distributions for non-negative real-valued data with many zeroes, GAMs

Question

Lots of data is real-valued, non-negative, and replete with 0s... for example any time count data are normalized by something: e.g., perhaps it makes sense normalize disease case counts taken over different intervals by time. Then you end up with something that has 0s like count data, but is no longer counts. What are reasonable error distributions or link functions for GLMs or generalized additive models when we have normalized count data/ real-valued non-negative data with lots of 0s?

I'm most interested in fitting a GAM smoother to normalized disease case counts, but I realize I don't know much about reasonable distributions for these kinds of data in general.

Here's some example data:

x <- 1:100
y_orig <- rnbinom(100, size = 0.2, mu = x/2)
y_scaled <- y_orig / sample(5:7, 100, replace = TRUE)
plot(x, y_scaled)
throws error b/c of 0s
use.gamma <- glm(y_orig~x, family = Gamma)
plot(use.gamma)
ditto
gamma_gam <- mgcv::gam(y_scaled~s(x), family = Gamma)
plot(gamma_gam, residuals = TRUE)
#maybe I can use a poisson family even though it's not counts?
use.pois<- glm(y_scaled~x, family = poisson)
no, this looks terrible
plot(use.pois)
GAM not better: negative values and poor fit
pois_gam <- mgcv::gam(y_scaled~s(x), family = poisson)
plot(pois_gam, residuals = TRUE)
quasi poisson?
use.qp<- glm(y_scaled~x, family = quasipoisson)
plot(use.qp)
qp_gam <- mgcv::gam(y_scaled~s(x), family = quasipoisson)
plot(qp_gam, residuals = TRUE)
try gaussian
gaus <- lm(y_scaled~x)
plot(gaus)
problems, q-q plot is a disaster
gaus_gam <- mgcv::gam(y_scaled ~ s(x), family = gaussian)
not worse than other options?
plot(gaus_gam, residuals = TRUE)
how about tweedie??

EDIT Here's a simulation that might be a bit more transparent:

set.seed(666)
x <- 1:100
scale_vec <- sample(c(7, 35), 100, replace = TRUE)
y_orig <- rnbinom(100, size = 0.2, mu = scale_vec*x/10 )
y_scaled <- y_orig / scale_vec

I'm aware of "hurdle" models and other mixture approaches, but wondering if there's anything that makes sense to use in a GAM — Michael Roswell, Nov 30 '22 at 04:50
relevant discussion: https://stats.stackexchange.com/q/1444/171222 — Michael Roswell, Nov 30 '22 at 04:51
Also looks relevant: https://stackoverflow.com/q/65745148/8400969 — Michael Roswell, Nov 30 '22 at 12:17
Why not use an offset in the count model which would handle the normalisation? In the example you would add offset(log(time_period)) to the formula where time_period is a variable contains the amount of time that each observations counts represent or were collected over. — Gavin Simpson, Nov 30 '22 at 15:52
@GavinSimpson Thanks for this! After chatting with a collaborator, I was just converging on this answer... in cases where the time_period values are known (my use case), a better approach is modelling the counts as the response rather than the normalized values, using an offset as you suggested. Thanks! — Michael Roswell, Nov 30 '22 at 16:42

score 1 · Answer 1 · answered Nov 30 '22 at 17:11

As @GavinSimpson points out, in a use case like mine (where the values I rescale the counts by are known to me), it is more sensible to model the counts directly with an appropriate distribution, and use the scaling values as offsets.

e.g.

# generate simulation data
set.seed(666)
x <- 1:100
scale_vec <- sample(c(7, 35), 100, replace = TRUE)
y_orig <- rnbinom(100, size = 0.2, mu = scale_vec*x/10 )
y_scaled <- y_orig / scale_vec
looks like we can fit this gam and it looks kinda reasonable
osg <- mgcv::gam(y_orig ~s(x) + offset(log(scale_vec)), family = mgcv::nb )
plot(osg, residuals = TRUE, trans = exp)

reasonable distributions for non-negative real-valued data with many zeroes, GAMs

throws error b/c of 0s

ditto

no, this looks terrible

GAM not better: negative values and poor fit

quasi poisson?

try gaussian

problems, q-q plot is a disaster

not worse than other options?

how about tweedie??

1 Answers1

looks like we can fit this gam and it looks kinda reasonable