Lots of data is real-valued, non-negative, and replete with 0s... for example any time count data are normalized by something: e.g., perhaps it makes sense normalize disease case counts taken over different intervals by time. Then you end up with something that has 0s like count data, but is no longer counts. What are reasonable error distributions or link functions for GLMs or generalized additive models when we have normalized count data/ real-valued non-negative data with lots of 0s?
I'm most interested in fitting a GAM smoother to normalized disease case counts, but I realize I don't know much about reasonable distributions for these kinds of data in general.
Here's some example data:
x <- 1:100
y_orig <- rnbinom(100, size = 0.2, mu = x/2)
y_scaled <- y_orig / sample(5:7, 100, replace = TRUE)
plot(x, y_scaled)
throws error b/c of 0s
use.gamma <- glm(y_orig~x, family = Gamma)
plot(use.gamma)
ditto
gamma_gam <- mgcv::gam(y_scaled~s(x), family = Gamma)
plot(gamma_gam, residuals = TRUE)
#maybe I can use a poisson family even though it's not counts?
use.pois<- glm(y_scaled~x, family = poisson)
no, this looks terrible
plot(use.pois)
GAM not better: negative values and poor fit
pois_gam <- mgcv::gam(y_scaled~s(x), family = poisson)
plot(pois_gam, residuals = TRUE)
quasi poisson?
use.qp<- glm(y_scaled~x, family = quasipoisson)
plot(use.qp)
qp_gam <- mgcv::gam(y_scaled~s(x), family = quasipoisson)
plot(qp_gam, residuals = TRUE)
try gaussian
gaus <- lm(y_scaled~x)
plot(gaus)
problems, q-q plot is a disaster
gaus_gam <- mgcv::gam(y_scaled ~ s(x), family = gaussian)
not worse than other options?
plot(gaus_gam, residuals = TRUE)
how about tweedie??
EDIT Here's a simulation that might be a bit more transparent:
set.seed(666)
x <- 1:100
scale_vec <- sample(c(7, 35), 100, replace = TRUE)
y_orig <- rnbinom(100, size = 0.2, mu = scale_vec*x/10 )
y_scaled <- y_orig / scale_vec
offset(log(time_period))to the formula wheretime_periodis a variable contains the amount of time that each observations counts represent or were collected over. – Gavin Simpson Nov 30 '22 at 15:52time_periodvalues are known (my use case), a better approach is modelling the counts as the response rather than the normalized values, using an offset as you suggested. Thanks! – Michael Roswell Nov 30 '22 at 16:42