0

I have four regressions I performed on four disjoint sets of my data (for background on why I did this, see this question). Happily, they all look quite similar. I'm happy to put all sets of regression coefficients in a table, but I would also like to present one composite analysis with one graph and one set of confidence intervals. It seems to me like it should be okay to average regression coefficients and the endpoints of confidence intervals (here's someone else saying they did that in another circumstance).

Here's my whole model:

RelevanceContrast_brain ~ TargetType * Taught * 
    RelevanceContrast_accuracy + (1|subject/cluster)

Brain is a continuous brain measurement, TargetType and Taught are categorical variables, and accuracy is a behavioral response to a likert scale. Relevance was a third categorical variable in the experiment and the contrast refers to the subtraction of values over the levels of relevance.

It seems important to me that my predictor and response variables are divided by the standard deviation to make regression coefficients comparable across datasets -- there are probably small differences in their variance due to sampling error. My first question is: must they also be mean-centered? I ask because since my predictor and response variables are both contrasts, while I have been normalizing both variables up to now, I think it's been making the effects harder for human beings to interpret. The temptation to look at a graph and see, for instance, a slope at a negative value for one level of a categorical variable and a positive value for another and to interpret that as a sign flip in the contrast is strong. I'm just not sure if mean centering is also a functional requirement of making sure the coefficients have the same meaning across the four regressions.

Katie
  • 263
  • I'm not sure if you can average the regression coefficients. Have you make a study such as comparing the coefficients training with all data set against the average ones? And the same with the confidence intervals? – Allan Jul 20 '22 at 13:47
  • Hi Allan, thanks for the engagement. I'm not 100% sure I understand what you mean (and keep in mind that "training" isn't what I'm doing), but if you mean did I run the regression on all the data, no, I didn't, both for the reasons outlined in the link post and, to be honest, because rlmer crashes with my full dataset. I am lucky that it manages to crank out my 1/4 datasets. – Katie Jul 20 '22 at 21:27
  • Hello, Katie. Training is the same as run the regression. What I tried to highlight is the I don't know if average the coefficients of sub-samples is a valid estimator to the global coefficients. If is you need to demonstrate before, but I don't know. And if the computational resources is a problem you could use more ram if it's problem. A reference: https://towardsdatascience.com/how-to-run-rstudio-on-aws-in-under-3-minutes-for-free-65f8d0b6ccda – Allan Jul 20 '22 at 23:51
  • Hi Allan, I actually can't. I don't know what's up with the robustlmm package but last year I tried provisioning a machine on AWS (using that same link as a guide) with almost arbitrary RAM and the same thing happened when I tried to run it on my full dataset. – Katie Jul 21 '22 at 10:07
  • One way would be implement the algorithm from scratch or try to debug inside the package to assess where is breaking the code. – Allan Jul 21 '22 at 14:43
  • Relative to my goal of finishing this paper, that would definitely be yak shaving. But as it happens, I've reflected more and I think it will actually be more persuasive to present the results of the four regressions separately, side by side. They look so similar that it really increases one's confidence in the result. So maybe this is just a question I don't need to answer. – Katie Jul 23 '22 at 16:38

0 Answers0