3

I am trying to model the effect of heat (in particular of its constituent components air temperature, humidity, solar radiation and wind speed) on daily number of deaths using a generalized additive model (GAM). Because deaths may occur not just on the same day but also over a period of time following particularly high exposure, distributed lag needs to be included in the model. The effects on death are likely to be non-linear and there are potentially also non-linear interactions between these 4 weather/heat variables. In addition there is rain fall, which is thought to be a confounder because if affects both heat and deaths, and so needs to be adjusted for. The aim of the modelling is to improve understanding of how day-to-day variability in these 4 heat variables are related to variability in death (rather than pure prediction performance). I am not interested in understanding the effects of seasonality and long-term trend and am trying to "remove" these.

The daily time series data (5,113 days), spanning 14 years from 2001-2014, looks as follows:

           date doy year time heap heap_bin deaths  temp    sh    rh  solar wind2m precip_hourly precip_daily_total
1016 2003-10-13 286 2003 1016    0        0      3 32.37 19.10 61.19 701.45   0.50          0.43               4.58
1017 2003-10-14 287 2003 1017    0        0      3 29.97 19.53 71.69 773.80   2.02          0.68               4.60
1018 2003-10-15 288 2003 1018   34        1     21 29.06 18.43 71.56 852.81   1.59          0.73               3.47
1019 2003-10-16 289 2003 1019    0        0      4 27.54 18.01 76.38 701.43   1.57          0.76               6.32
1020 2003-10-17 290 2003 1020    0        0      2 29.86 17.76 65.88 669.89   2.15          0.36               3.49
1021 2003-10-18 291 2003 1021    0        0      5 31.03 16.91 58.56 857.88   3.14          0.03               0.21

Explanation of some variables: doy = day of year, running from 1-366, time = a counter of days running from 1 – 5,113, temp = temperature, sh = specific humidity, solar = solar irradiance, wind2m = wind speed and precip_daily_total = total daily rain fall. Heap and heap_bin are explained below.

The data set (.rds and .csv) can be found on my (newly created) GitHub site.

I am struggling with two issues, which may impact on each other, hence a single post.

Problem 1: large measurement error in deaths

All deaths for which no specific date was known, were assigned to the 15th day of the month in which the deaths were thought to have occurred, a so-called heaping effect. The median number of deaths per day is 2 (data from a small population), but on the 15th of each month there is a substantial increase (max = 48) which can account for about a third of all deaths in a month, depending on the month. (The data is from a low-income country without resources for adequate registration of deaths and births. Data were collected via household interviews approx. 1-3 times per year). I have coded heaping in the data in two ways:

  • as a binary categorical variable called heap_bin (0 for non-heaping day, 1 for heaping day)
  • as a categorical variable with 169 levels called heap (0 for all non-heaping days, 1 for the first heaping day, 2 for the second … 168 for the final heaping day over the 14 year period). This variable has a non-linear relationship with deaths (see image below). Note that level 0 has 4,945 days, whereas levels 1-168 each only have one day.

enter image description here

I am unsure how to model this heaping. I see at least 4 options:

  1. as a binary (linear) variable (heap_bin) – though this seems too simplistic - e.g. as in R/ mgcv pseudo code:
log(deaths) =  te(year, day) + heap_bin + 
     te(temp, lag) + …. 
  1. as a linear categorical variable with 169 levels (heap), e.g.
log(deaths) =  te(year, day) + heap + 
       te(temp,lag) + ….
  1. as a factor smooth interaction using heap_bin as a by variable, e.g.
log(deaths) =  te(year, day, by = heap_bin) + 
       heap_bin + te(temp, lag) + ...
  1. with a factor smooth, along the lines
s(year, doy, heap, bs="fs")

or

s(time, heap, bs="fs")

? This feels to me as the most appropriate way.

But
a. is it possible to have 3 variables in this smooth and have different splines for each? I'd like to model doy with a cyclic spline
b. is it possible to have such a smooth in the model alongside te() or ti() interactions?
c. does this work with REML,
d. can such models be compared with AIC and
e. how data-hungry is this approach, given that besides modelling time, I need to specify 5 lagged "main effects" (i.e. lagged temperature , lagged humidity, lagged solar, lagged wind and lagged precipitation) and explore interactions between these, potentially including a 5-way interaction (one can dream...)

  1. some other way?

Problem 2: specifying interactions, including those with distributed lag

I want to find out if besides main effects for (lagged variables) temp, sh, wind2m and solar, there are also interactions between some or indeed all of these 4 weather variables. Interactions can be specified excluding marginal/ main effects (using ti() in mgcv in R) or including main effects (using te()). I am unsure what the best approach is in this case.

Simon Wood provides an example on air pollution related deaths (2017, Generalized Additive models, p. 352, see also p.55 of accompanying code/ documentation in gamair package manual). He models distributed lag as an interaction which includes the main effect, i.e. as

log(deaths) = s(time) + te(particulate_matter, 
       lag) + te(ozone, temperature, lag)

So following that example (and ignoring the heaping issue above for a moment), my model might look like:

log(deaths) = te(year, doy) + te(temp, lag) + 
      te(sh, lag) + te(wind2m, lag) + 
      te(solar, lag) + 
      te(temp, sh, wind2m, solar, lag)

Note that I have modelled te(year, doy) instead of s(time) to allow for an interaction between the seasonality pattern doy and long-term trend year to account for El-Nino/ la Nina effects.

However, am I “allowed” to specify te() interactions that include terms that are also used in other te() interactions in the same model? Don’t the main effects for temp, lag and sh get “duplicated” in this way? In other words, can I specify:

log(deaths) = te(temp, lag) + te(temp, sh, lag) + 
      te(temp, sh, wind2m, solar, lag) + …

If the above approach is not allowed, I guess I would have to model with ti() interactions that separate the main effect from the interaction effect, something like:

log(deaths) = s(year) + s(doy) + ti(year, doy) + 
    s(temp) + s(lag) + ti(temp, lag) + s(sh) + 
    ti(sh, lag) + s(wind2m) + ti(wind2m, lag) + 
    s(solar) + ti(solar, lag) + 
    ti(temp, sh, wind2m, solar, lag)

Gavin Simpson explains that this approach can be useful for establishing whether or not you need certain interactions. He also provides a great example on Galveston Bay, using this type of interaction. But I’m not sure how I should interpret the s(lag) term in this case. What would “main effect of lag” (i.e. the term s(lag)) mean??

Any help would really be appreciated, as I am struggling quite a bit with this.

For anyone interested, this is (part of) my code so far:

library(readr)
library(mgcv)

df <- read_rds("data_crossvalidated_post.rds")

Create matrices for 7-day lagged variables based on example by Simon Wood (ref in post)

lagard <- function(x,n.lag=7) { n <- length(x); X <- matrix(NA, n, n.lag) for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1] X }

dat <- list(lag=matrix(0:6, nrow(df), 7, byrow=TRUE), deaths=df$deaths, doy=df$doy, year = df$year, time = df$time, heap=df$heap, heap_bin = df$heap_bin) dat$precip_hourly <- lagard(df$precip_hourly) dat$precip_daily_total <- lagard( df$precip_daily_total) dat$temp <- lagard(df$temp) dat$sh <- lagard(df$sh) dat$rh <- lagard(df$rh) dat$solar <- lagard(df$solar) dat$wind2m <- lagard(df$wind2m)

knots <- list(doy=c(0.5, 366.5)) # set knots for # cyclic spline for day of year

model currently being attempted to run

m3 <- gam(deaths~te(year, doy, bs = c('cr', 'cc'), k = c(7, 50)) + heap + te(temp, lag, k=c(8, 4), bs = c('cr', 'cr')) + te(sh, lag, k=c(8,4), bs = c('cr', 'cr')) + te(solar, lag, k=c(8,4), bs = c('cr', 'cr')) + te(wind2m, lag, k=c(8,4), bs = c('cr', 'cr')) + te(precip_daily_total, lag, k=c(8,4), bs = c('cr', 'cr')) + te(temp, sh, solar, wind2m, lag, bs = c('cr', 'cr', 'cr', 'cr', 'cr')), data = dat, family = poisson, method = 'REML', knots = knots)

Jade
  • 351
  • I have now read that, as I suspected, it is not a good idea to use te() interactions with fully nested models. This means I'll have to model with ti() interactions. I'm also thinking that a factor smooth with bs="fs" for heaping_bin is the best way to deal with the heaping problem. But can such a factor smooth be included in a ti() interaction between year and doy ? If so, is it possible to set a cyclic spline for doy and a thin plate regression spline for year? – Jade May 12 '22 at 19:47

0 Answers0