I was reading this blog post:
the author describes a model to predict how many hours a day time a mammal sleeps, given its brain mass. Of course, since you cannot sleep more than 24 hrs/day, the author doesn't directly use standard linear regression, but divides the response by 24, so that it's bounded between 0 and 1 (and then he multiplies it back by 24 when making predictions, but I'm not interested in the prediction part: I'm concerned about the choice of statistical model). It also takes the log of brain mass, because the relationship between hours slept per day and brain mass seems linear on the semilog scale. Here's the relevant code, if you don't want to look at the link:
# required libraries
library(dplyr)
library(ggplot2)
library(rstanarm)
# filter data and transform some variables
msleep <- msleep %>%
filter(!is.na(brainwt)) %>%
mutate(log_brainwt = log10(brainwt),
log_bodywt = log10(bodywt),
log_sleep_total = log10(sleep_total),
logit_sleep_ratio = qlogis(sleep_total/24))
# fit model
m1 <- stan_glm(
logit_sleep_ratio ~ log_brainwt,
family = gaussian(),
data = msleep,
prior = normal(0, 3),
prior_intercept = normal(0, 3))
Unless I'm mistaken, the model used here is a linear regression of the logit of the normalized hours of sleeps, on the log of brain mass, considering Gaussian errors. I don't think this is the best approach for this problem (definitely it's not the only possible one). Which approach would be appropriate? Logistic regression is not appropriate, in my opinion, because the response isn't a binary variable, but it's a true percentage. I think Beta regression would be better, but I'm open to other suggestions. I would expecially be interested in pros and cons of different suitable approaches.
Of course, statistical modeling cannot be done in a vacuum, so here is a simple plot of the data:
library(ggrepel)
# simplify some names for plotting
ex_mammals <- c("Domestic cat", "Human", "Dog", "Cow", "Rabbit",
"Big brown bat", "House mouse", "Horse", "Golden hamster")
renaming_rules <- c(
"Domestic cat" = "Cat",
"Golden hamster" = "Hamster",
"House mouse" = "Mouse")
ex_points <- msleep %>%
filter(name %in% ex_mammals) %>%
mutate(name = stringr::str_replace_all(name, renaming_rules))
# create labels
lab_lines <- list(
brain_log = "Brain mass (kg., log-scaled)",
sleep_raw = "Sleep per day (hours)",
sleep_log = "Sleep per day (log-hours)"
)
# finally, make the actual plot
ggplot(msleep) +
aes(x = brainwt, y = sleep_total) +
geom_point(color = "grey40") +
# Circles around highlighted points + labels
geom_point(size = 3, shape = 1, color = "grey40", data = ex_points) +
geom_text_repel(aes(label = name), data = ex_points) +
# Use log scaling on x-axis
scale_x_log10(breaks = c(.001, .01, .1, 1)) +
labs(x = lab_lines$brain_log, y = lab_lines$sleep_raw)

posterior_predict()function to predict new observations in the range of the original data (see paragraph Option 3: Mean and 95% interval for model-generated data). For a log-brain mass of -3.853872 (which corresponds to the lowest non-NA value in the original data set: you can check) he gets an upper bound of the 95% prediction interval which is 1.577798, corresponding to $10^{1.577798} \approx 38$ hours. This means that about 5% of the 4000 new observations his model generates for this brain size... – DeltaIV Aug 29 '17 at 21:20If this question is an endeavor about a showcase of methods in combination with ggplot, then use different data.
– Sextus Empiricus Sep 01 '17 at 10:27betaregpackage, the stan_betareg function does not work well with me) gives me $\frac{sleep}{awake} = 0.088 , {w_{brain}}^{-0.42}$. It is all very close and before trying to get a "best" or 'appropriate' fit (where I do not see the problem with the 1st method), the data (and the goals of the fitting) should be described more closely. – Sextus Empiricus Sep 04 '17 at 04:16