6

Motivation Question

As the title suggests, I'm looking for examples of offsets in regression that make sense in my realm of research (early child development of reading and writing). To be clear, I am aware from a general sense what an offset is...it is simply a way to incorporate a constant, fixed value into a regression. From what I gather, this simply adds some column to the model matrix of a regression model with the corresponding parameter fixed at some number of interest.

However, the examples I have seen online don't really make the distinction clear to me how this is practically useful. I was curious if somebody could give an example of when it would be useful in my own field (even if it is a loose heuristic example) to make this point more clear to me. The reason I ask is because I have pretty much never seen offsets used in regressions in my field, but I see they are often more frequent in "hard" sciences. I'm wondering if that is by design (offsets not being useful for literacy studies) or by accident (people are simply unaware of this modeling technique in my field).

Typical Predictors

To aid the discussion, some typical predictors of literacy in my field include:

  • Vocabulary ability
  • Listening comprehension
  • Morphological awareness (the ability to construct words with morphemes)
  • Phonological awareness (an understanding of sound rules in writing)
  • Age
  • Nonverbal IQ
  • Socio-economic status

I'm guessing that some information like this can be incorporated into the model a priori with some offset. One case I considered is if we know that some student's neighborhood will double the word reading rate of some population before conducting the study. Could this be a viable use of an offset?

  • 1
    What does your regression look like exactly, mathematically? For a log-linked model such offset converts a count into a rate (a word I see appear in your last paragraph), so that would be the use case where it probably appears most. In formula form, the offset $t$ converts your model $log(\mu)=X\beta$ into $log(\mu/t)=X\beta$ which is equivalent by the rules of logs to $log(\mu)=X\beta + log(t)$. For other link functions an offset may be less meaningful or more difficult if not impossible to interpret. – PBulls Dec 13 '23 at 09:10
  • I do not have a specific regression in mind. This is more just to find out when such techniques are practically useful for my domain. – Shawn Hemelstrand Dec 13 '23 at 09:17
  • 1
    When will ratios occur? For instance, if you are counting words, but some denominator of interest is not constant (time, some children might have different attention ... or some texts of different lengths ... did you have a look at https://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Dec 13 '23 at 22:23
  • Here is a paper close to your field that seems to use offsets: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7454245/ – kjetil b halvorsen Dec 14 '23 at 03:08
  • 1
    Thanks. That's actually exactly what I was looking for. – Shawn Hemelstrand Dec 14 '23 at 04:07

2 Answers2

7

Offsets are mostly used for modeling ratios. So ask when ratios do occur in your field. For instance, if you are counting words, but some denominator of interest is not constant (time, some children might have different attention spans ... or some texts of different lengths. Have a look at Goodness of fit and which model to choose linear regression or Poisson

I searched and found a paper close to your field which seems to use offsets "Letter Teaching in Parent–Child Conversations"

2

Some Open Data and Research Questions

Kjetil's answer was already sufficient for my question, but I felt in case it may be helpful to show an example on open data, I would provide a case here. The data comes from the book Statistics for Linguists and tests the Nettle Hypothesis, which is the idea that ecological diversity predicts the number of languages in a country. Importantly, as countries tend to get larger, the number of languages tends to increase (with more groups of people speaking different languages), thus the rate can be formulated as:

$$ \text{languages per square mile} = \lambda = \frac{\mu}{\tau} $$

and the linear equation as:

$$ \text{log}(\mu) = \beta_0 + \beta_1\text{MGS} + \text{log}(\tau) $$

Application in R

The data can be found in the dput below:

nettle <- structure(list(Country = c("Algeria", "Angola", "Australia", 
"Bangladesh", "Benin", "Bolivia", "Botswana", "Brazil", "Burkina Faso", 
"CAR", "Cambodia", "Cameroon", "Chad", "Colombia", "Congo", "Costa Rica", 
"Cote d'Ivoire", "Cuba", "Ecuador", "Egypt", "Ethiopia", "French Guiana", 
"Gabon", "Ghana", "Guatemala", "Guinea", "Guyana", "Honduras", 
"India", "Indonesia", "Kenya", "Laos", "Liberia", "Libya", "Madagascar", 
"Malawi", "Malaysia", "Mali", "Mauritania", "Mexico", "Mozambique", 
"Myanmar", "Namibia", "Nepal", "Nicaragua", "Niger", "Nigeria", 
"Oman", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines", 
"Saudi Arabia", "Senegal", "Sierra Leone", "Solomon Islands", 
"Somalia", "South Africa", "Sri Lanka", "Sudan", "Suriname", 
"Tanzania", "Thailand", "Togo", "UAE", "Uganda", "Vanuatu", "Venezuela", 
"Vietnam", "Yemen", "Zaire", "Zambia", "Zimbabwe"), Population = c(4.41, 
4.01, 4.24, 5.07, 3.69, 3.88, 3.13, 5.19, 3.97, 3.5, 3.93, 4.09, 
3.76, 4.53, 3.37, 3.49, 4.1, 4.03, 4.04, 4.74, 4.73, 2.01, 3.08, 
4.19, 3.98, 3.77, 2.9, 3.72, 5.93, 5.27, 4.41, 3.63, 3.43, 3.67, 
4.06, 3.93, 4.26, 3.98, 3.31, 4.94, 4.21, 4.63, 3.26, 4.29, 3.6, 
3.9, 5.05, 3.19, 3.39, 3.58, 3.64, 4.34, 4.8, 4.17, 3.88, 3.63, 
3.52, 3.89, 4.56, 4.24, 4.41, 2.63, 4.45, 4.75, 3.56, 3.21, 4.29, 
2.21, 4.31, 4.83, 4.09, 4.56, 3.94, 4), Area = c(6.38, 6.1, 6.89, 
5.16, 5.05, 6.04, 5.76, 6.93, 5.44, 5.79, 5.26, 5.68, 6.11, 6.06, 
5.53, 4.71, 5.51, 5.04, 5.45, 6, 6.09, 4.95, 5.43, 5.38, 5.04, 
5.39, 5.33, 5.05, 6.52, 6.28, 5.76, 5.37, 5.05, 6.25, 5.77, 5.07, 
5.52, 6.09, 6.01, 6.29, 5.9, 5.83, 5.92, 5.15, 5.11, 6.1, 5.97, 
5.33, 4.88, 5.67, 5.61, 6.11, 5.48, 6.33, 5.29, 4.86, 4.46, 5.8, 
6.09, 4.82, 6.4, 5.21, 5.98, 5.71, 4.75, 4.92, 5.37, 4.09, 5.96, 
5.52, 5.72, 6.37, 5.88, 5.59), MGS = c(6.6, 6.22, 6, 7.4, 7.14, 
6.92, 4.6, 9.71, 5.17, 8.08, 8.44, 9.17, 4, 11.37, 9.6, 8.92, 
8.67, 7.46, 8.14, 0.89, 7.28, 10.4, 8.79, 8.79, 9.31, 7.38, 12, 
8.54, 5.32, 10.67, 7.26, 7.14, 10.62, 2.43, 7.33, 5.8, 11.92, 
3.59, 0.75, 5.84, 6.07, 6.93, 2.5, 6.39, 8.13, 2.4, 7, 0, 9.2, 
10.88, 10.25, 2.65, 10.34, 0.4, 3.58, 8.22, 12, 3, 6.05, 9.59, 
4.02, 12, 7.02, 8.04, 7.91, 0.83, 10.14, 12, 7.98, 8.8, 0, 9.44, 
5.43, 5.29), Langs = c(18L, 42L, 234L, 37L, 52L, 38L, 27L, 209L, 
75L, 94L, 18L, 275L, 126L, 79L, 60L, 10L, 75L, 1L, 22L, 11L, 
112L, 11L, 40L, 73L, 52L, 29L, 14L, 9L, 405L, 701L, 58L, 93L, 
34L, 13L, 4L, 14L, 140L, 31L, 8L, 243L, 36L, 105L, 21L, 102L, 
7L, 21L, 427L, 8L, 13L, 862L, 21L, 91L, 168L, 8L, 42L, 23L, 66L, 
14L, 32L, 7L, 134L, 17L, 131L, 82L, 43L, 9L, 43L, 111L, 40L, 
88L, 6L, 219L, 38L, 18L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -74L))

To test the Nettle Hypothesis, the model is fit with an offset like so, which models the main effect of mean growing season and the rate of landmass ("area") with an offset:

#### Load Libraries ####
library(broom)
library(tidyverse)

Fit Exposure Model

fit <- glm(Langs ~ MGS + offset(Area), data = nettle, family= 'poisson') tidy(fit)

Giving the following output, where MGS has an overall positive association with prevalence of languages in a given country:

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -2.82    0.0407      -69.3       0
2 MGS            0.209   0.00472      44.3       0

The plotted lines are shown below (one with an offset in green and one without in red):

#### Get Prediction Data ####
nd <- data.frame(
  MGS = seq(
    min(nettle$MGS),
max(nettle$MGS),
    length.out=200
  ),
  Area = mean((nettle$Area))
)

pd <- predict(fit, newdata=nd, type = "response")

pred <- data.frame(MGS = nd$MGS, Pred = pd)

Plot Data Comparisons

nettle %>% ggplot(aes(x = MGS, y = Langs)) + geom_point()+ geom_line(data = pred, mapping= aes(x = MGS, y = Pred), col='darkgreen', linewidth= 1) + stat_smooth( method = "glm", method.args = list(family = poisson()), se = F, color = "darkred", formula = y ~ x )+ theme_bw()+ labs(x="Mean Growing Season", y="Languages Within Country", title = "The Nettle Hypothesis")

One can see that with the exposure model that the slope increases relative to the standard model, indicating that the inclusion of an offset alters our predictions of language diversity in a given country.

enter image description here

  • Trying to replicate your code, it seems there is an issue in your first block of code that defines the variable nettle, with the parameter problems = <pointer: 0x0000023944070c90>. R (v. 4.1.3) throws an error ("unexpected '<'"). – J-J-J Feb 19 '24 at 08:30
  • 1
    Sorry about that! I tried fixing the dput and edited the answer accordingly. Let me know if it still has problems. – Shawn Hemelstrand Feb 19 '24 at 08:37
  • 1
    Works like a charm, thanks! – J-J-J Feb 19 '24 at 08:42