Regression modelling: fitting a model to explain why variability of a group of subjects is higher than another

Question

i created a toy example data below to give readers a sense of the data structure. Y is the continuous, dependent variable. I know that the standard deviation of Y of the subjects in the "Before" Treatment group is much higher than that in the "After" Treatment group.

And my goal is to identify the factors that can explain the higher variability of Y in the "Before" treatment.

Which regression model is suitable for this ? And do I have to melt / aggregate the dataframe ?

Avoid adding code, data or output as images for these reasons. In R (not sure you use R), there are several ways to provide data, probably adding the output of dput(<yourData>) to your question is easiest. Also explain what situation your toy data is suppose to represent. For example, what does it mean that patient A has a before measurement and after measurement on days 1, ..., 5? — dipetkov, Feb 22 '24 at 10:40
Did you look at the answers to https://stats.stackexchange.com/questions/505490/regression-with-variance-as-outcome ? — David B, Feb 22 '24 at 15:17
There is a clear, single factor fully explaining the differences in variability of $Y:$ Before vs. After. Indeed, $Y$ appears to be expressed on two different scales that can be made commensurate simply by dividing all the "Before" values by 10. — whuber, Feb 25 '24 at 14:49
@dipetkov, my apologies. The data are randomly generated. So even if you plug the data into any model, nothing meaningful will be produced. — user1769197, Feb 28 '24 at 02:47

score 6 · Answer 1 · answered Feb 23 '24 at 13:33

Your situations combines multiple features that make this a bit trickier than usual, namely that you want to model differences in distribution parameters (also known as distributional regression) and that you have multiple measurements per subject (repeated measures). You should somehow reflect that observations from the same subject are likely somehow correlated (in the example below I allow for completely unstructured covariance matrix), for which you could choose a MMRM (mixed effects model for repeated measures). The default implementation thereof (e.g. in the mmrm R package or PROC MIXED in SAS) would let you estimate different covariance structures (and let you constrain them to have the same variance for each day, if you want) for the two groups, but would not let you model explanatory factors for the variance. That's mostly a lack of flexibility of existing software rather than a theoretical problem, but at least the brms R package lets you do this (the below assumes DAY and Treatment are factor variables):

brm(Y ~ 1 + Day+ Treatment + Day:Treatment + X1 + X2,
    autocor = ~unstr(time=Day, gr=Subject),
    sigma ~ 1 + Day + Treatment + Day:Treatment + X1 + X2)

This all works with the existing data structure you have and would deal gracefully with any missing days for some subjects (which some other formulations of the problem would not deal with this so easily).

Whether the particular models for the mean outcomes and the residual standard deviation (or rather the log-thereof, because brms will by default use a log-link for sigma parameter) is what you want to look into depends on what you want to investigate. E.g. you could say the SD is the same on all days and omit Day terms in the model for sigma, or you could investigate whether differences in X1 or X2 explain differences in sigma (in which case you might have them as main effects as above), or whether Day/X1/X2 influence the variance differently by group (in which case interactions with Treatment make sense).

Note that brms is a R package for flexible modeling in R using a Bayesian approach (building on top of Stan) that by default uses the NUTS MCMC sampler. The tidybayes, posterior, ggdist and various other packages are great for working with the results from such analyses.

I have no idea if that's what the OP intended but a patient is measured twice a day under both treatments. What does the model imply for the correlation between: two Y's on the same day; two Before Y's on consecutive days; one Before and After Y on two consecutive days (all for the same patient)? — dipetkov, Feb 23 '24 at 19:23
Yes a patient is measured twice a day and we can consider the two Y's on the same day are independent. 2 Before Y's on consecutive days can also be considered independent. And one Before and After Y on two consecutive days is also considered to be independent for all patients.
And it seems like you are suggesting using mixed effect model to study the variance /covariance of the parameters? Sorry I am not familiar with bayesian approach at all. — user1769197, Feb 28 '24 at 03:00
Can I ask which part of the regression output would indicate how much a particular input feature X is causing higher variability of Y ? — user1769197, Feb 28 '24 at 03:42

DrJerryTAO · Answer 2 · 2024-02-27T10:32:00.167

The data structure looks fine except Day needs recoding. I guess Day = 1 for Treatment = Before and Treatment = After are different days. Assuming Day = 5 when Treatment = Before is the last day before treatment, we can redefine Day = ifelse(Treatment == "Before", Day - 5, Day) so that Day ranges -4 to 5.

Björn gave the Bayesian approach. In frequentist mixed-effects modeling, there are also options to incorporate error-variance heterogeneity. Using nlme, a model could be

lme(Y ~ Treatment * Day + X1 + X2, random = ~ 1 | Subject, weights = varIdent(form = ~ 1 | Treatment))
gls(Y ~ Treatment * Day + X1 + X2, correlation = corAR1(form = ~ Day | Subject), weights = varIdent(form = ~ 1 | Treatment))

Random slopes of Day before and after treatment by Subject are worth testing. The random-intercept model is equivalent to GLS with a compound-symmetry correlation structure where correlation = corCompSymm(form = ~ 1 | Subject).

glmmTMB also estimate heteroscedasticity with random effects and correlated errors. See tutorials https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html.

glmmTMB(Y ~ Treatment * Day + X1 + X2 + (1 | Subject), dispformula = ~ Treatment, family = gaussian)
glmmTMB(Y ~ Treatment * Day + X1 + X2 + ar1(0 + factor(Day) | Subject), dispformula = ~ Treatment, family = gaussian)

See a tutorial https://hbiostat.org/rmsc/long that includes both frequentist and Bayesian longitudinal data analysis.

Can I ask which part of the regression output would indicate how much a particular input feature X is causing higher variability of Y ? — user1769197, Feb 28 '24 at 03:42
In nlme prints, it is called "Variance function". In glmmTMB, it is the "Dispersion model". They use different parameterizations, so the reported coefficients are different. — DrJerryTAO, Feb 28 '24 at 05:03

Regression modelling: fitting a model to explain why variability of a group of subjects is higher than another

2 Answers2