Some Open Data and Research Questions
Kjetil's answer was already sufficient for my question, but I felt in case it may be helpful to show an example on open data, I would provide a case here. The data comes from the book Statistics for Linguists and tests the Nettle Hypothesis, which is the idea that ecological diversity predicts the number of languages in a country. Importantly, as countries tend to get larger, the number of languages tends to increase (with more groups of people speaking different languages), thus the rate can be formulated as:
$$
\text{languages per square mile} = \lambda = \frac{\mu}{\tau}
$$
and the linear equation as:
$$
\text{log}(\mu) = \beta_0 + \beta_1\text{MGS} + \text{log}(\tau)
$$
Application in R
The data can be found in the dput below:
nettle <- structure(list(Country = c("Algeria", "Angola", "Australia",
"Bangladesh", "Benin", "Bolivia", "Botswana", "Brazil", "Burkina Faso",
"CAR", "Cambodia", "Cameroon", "Chad", "Colombia", "Congo", "Costa Rica",
"Cote d'Ivoire", "Cuba", "Ecuador", "Egypt", "Ethiopia", "French Guiana",
"Gabon", "Ghana", "Guatemala", "Guinea", "Guyana", "Honduras",
"India", "Indonesia", "Kenya", "Laos", "Liberia", "Libya", "Madagascar",
"Malawi", "Malaysia", "Mali", "Mauritania", "Mexico", "Mozambique",
"Myanmar", "Namibia", "Nepal", "Nicaragua", "Niger", "Nigeria",
"Oman", "Panama", "Papua New Guinea", "Paraguay", "Peru", "Philippines",
"Saudi Arabia", "Senegal", "Sierra Leone", "Solomon Islands",
"Somalia", "South Africa", "Sri Lanka", "Sudan", "Suriname",
"Tanzania", "Thailand", "Togo", "UAE", "Uganda", "Vanuatu", "Venezuela",
"Vietnam", "Yemen", "Zaire", "Zambia", "Zimbabwe"), Population = c(4.41,
4.01, 4.24, 5.07, 3.69, 3.88, 3.13, 5.19, 3.97, 3.5, 3.93, 4.09,
3.76, 4.53, 3.37, 3.49, 4.1, 4.03, 4.04, 4.74, 4.73, 2.01, 3.08,
4.19, 3.98, 3.77, 2.9, 3.72, 5.93, 5.27, 4.41, 3.63, 3.43, 3.67,
4.06, 3.93, 4.26, 3.98, 3.31, 4.94, 4.21, 4.63, 3.26, 4.29, 3.6,
3.9, 5.05, 3.19, 3.39, 3.58, 3.64, 4.34, 4.8, 4.17, 3.88, 3.63,
3.52, 3.89, 4.56, 4.24, 4.41, 2.63, 4.45, 4.75, 3.56, 3.21, 4.29,
2.21, 4.31, 4.83, 4.09, 4.56, 3.94, 4), Area = c(6.38, 6.1, 6.89,
5.16, 5.05, 6.04, 5.76, 6.93, 5.44, 5.79, 5.26, 5.68, 6.11, 6.06,
5.53, 4.71, 5.51, 5.04, 5.45, 6, 6.09, 4.95, 5.43, 5.38, 5.04,
5.39, 5.33, 5.05, 6.52, 6.28, 5.76, 5.37, 5.05, 6.25, 5.77, 5.07,
5.52, 6.09, 6.01, 6.29, 5.9, 5.83, 5.92, 5.15, 5.11, 6.1, 5.97,
5.33, 4.88, 5.67, 5.61, 6.11, 5.48, 6.33, 5.29, 4.86, 4.46, 5.8,
6.09, 4.82, 6.4, 5.21, 5.98, 5.71, 4.75, 4.92, 5.37, 4.09, 5.96,
5.52, 5.72, 6.37, 5.88, 5.59), MGS = c(6.6, 6.22, 6, 7.4, 7.14,
6.92, 4.6, 9.71, 5.17, 8.08, 8.44, 9.17, 4, 11.37, 9.6, 8.92,
8.67, 7.46, 8.14, 0.89, 7.28, 10.4, 8.79, 8.79, 9.31, 7.38, 12,
8.54, 5.32, 10.67, 7.26, 7.14, 10.62, 2.43, 7.33, 5.8, 11.92,
3.59, 0.75, 5.84, 6.07, 6.93, 2.5, 6.39, 8.13, 2.4, 7, 0, 9.2,
10.88, 10.25, 2.65, 10.34, 0.4, 3.58, 8.22, 12, 3, 6.05, 9.59,
4.02, 12, 7.02, 8.04, 7.91, 0.83, 10.14, 12, 7.98, 8.8, 0, 9.44,
5.43, 5.29), Langs = c(18L, 42L, 234L, 37L, 52L, 38L, 27L, 209L,
75L, 94L, 18L, 275L, 126L, 79L, 60L, 10L, 75L, 1L, 22L, 11L,
112L, 11L, 40L, 73L, 52L, 29L, 14L, 9L, 405L, 701L, 58L, 93L,
34L, 13L, 4L, 14L, 140L, 31L, 8L, 243L, 36L, 105L, 21L, 102L,
7L, 21L, 427L, 8L, 13L, 862L, 21L, 91L, 168L, 8L, 42L, 23L, 66L,
14L, 32L, 7L, 134L, 17L, 131L, 82L, 43L, 9L, 43L, 111L, 40L,
88L, 6L, 219L, 38L, 18L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -74L))
To test the Nettle Hypothesis, the model is fit with an offset like so, which models the main effect of mean growing season and the rate of landmass ("area") with an offset:
#### Load Libraries ####
library(broom)
library(tidyverse)
Fit Exposure Model
fit <- glm(Langs ~ MGS + offset(Area),
data = nettle,
family= 'poisson')
tidy(fit)
Giving the following output, where MGS has an overall positive association with prevalence of languages in a given country:
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -2.82 0.0407 -69.3 0
2 MGS 0.209 0.00472 44.3 0
The plotted lines are shown below (one with an offset in green and one without in red):
#### Get Prediction Data ####
nd <- data.frame(
MGS = seq(
min(nettle$MGS),
max(nettle$MGS),
length.out=200
),
Area = mean((nettle$Area))
)
pd <- predict(fit,
newdata=nd,
type = "response")
pred <- data.frame(MGS = nd$MGS,
Pred = pd)
Plot Data Comparisons
nettle %>%
ggplot(aes(x = MGS,
y = Langs)) +
geom_point()+
geom_line(data = pred,
mapping= aes(x = MGS,
y = Pred),
col='darkgreen',
linewidth= 1) +
stat_smooth(
method = "glm",
method.args = list(family = poisson()),
se = F,
color = "darkred",
formula = y ~ x
)+
theme_bw()+
labs(x="Mean Growing Season",
y="Languages Within Country",
title = "The Nettle Hypothesis")
One can see that with the exposure model that the slope increases relative to the standard model, indicating that the inclusion of an offset alters our predictions of language diversity in a given country.
