How to statistically combine regression coefficients derived from subsamples of data

Question

I wrote a version of this question yesterday, and I think in my effort to be brief, I wasn't clear. So I'm trying again.

I have questions about a cross validation for ensuring the independence of data used to define clusters in a task fMRI study, and the data used to calculate a regression predicting brain activation within those clusters from other experimental variables, and a trial by trial behavioral response.

Data collection method: task fMRI + a behavioral measure (a likert scale response captured by button press).

Experimental design: 3 x 2 x 2 factorial design, variables we call TargetType, Relevance, and Taught, respectively. All these variables are within-subject. Relevance and Taught vary within run. TargetType varies between runs. Each subject did 12 runs in the scanner (4 for each TargetType, assuming no missing data). 30 trials in a run. A behavioral variable, Accuracy, which is collected for each trial (varies from 1-6).

Analysis: First and second level whole-brain analysis of the fMRI data has already been performed, modeling every experimental condition in the first level and testing contrasts of those conditions at the second level. I also performed a mixed model for the behavioral data (Accuracy ~ TargetType * Relevance * Taught + (1|Subject/Run). Among other effects, there is a strong effect of Relevance in the fMRI data, yielding several significant clusters, and a strong effect of Relevance in the behavioral data.

I am interested in the relationship between the relevance effect in the brain and the relevance effect in behavior. I constructed Relevance contrasts in both the brain and behavioral data, averaged over a single cluster, and averaged within run and Taught condition, to give the following model:

RelevanceContrast_brain ~ TargetType * Taught * RelevanceContrast_accuracy + (1|subject/run) + (1|subject:cluster)

I have several questions about this analysis. Kriegeskorte (2009) alerts me that I can't use the same data to select clusters as we use to perform our regression onto our behavioral data. Following (with a variation) one of the recommendations in the supplemental of that article, I split the runs ¾-¼, with ¾ used to generate the clusters, and ¼ used to run the above regression, except, since there is now only one run per subject after this operation, the model becomes RelevanceContrast_brain ~ TargetType * Taught * RelevanceContrast_accuracy+ (1|subject/cluster)

Upon running this model, I get several significant regression coefficients, including three way interactions. These results are interpretable and I could just stop here. The problem: I suspect this isn't a very stable estimate of these effects. If, as an experiment, I run the "illegitimate" regression using the original clusters and all the data, I get important changes in regression coefficients.

In the supplemental to the article linked above, Kriegeskorte writes:

Crossvalidation is a form of data splitting. (It thus falls under “independent split-data analysis” in Fig. 4.) When we split the data into two independent sets, we may designate one set as the selection (or training) set and the other set as the test set. Obviously the opposite assignment of the two sets would be equally justified. Since the two assignments will not yield identical results, we are motivated to perform the analysis for each assignment and combine the results statistically, for greater power. This approach is the simplest form of crossvalidation: a 2-fold crossvalidation. An n-fold crossvalidation generalizes this idea and allows us to use most of the data for selection (or training) and all of the data for selective analysis, while maintaining independence of the sets. For n-fold crossvalidation, we divide the data into n independent subsets. For each fold i=1..n, we use set i for selective analysis after using all other sets for selection (or training). Finally, the n selective analyses are statistically combined. An n-fold crossvalidation for n>2 potentially confers greater power than a 2-fold crossvalidation, because the n-fold crossvalidation provides more data for selection (or training) on each fold. Crossvalidation is a very general and powerful method widely used in statistical learning and pattern classification. However, it is somewhat cumbersome and computationally costly. While it is standard practice in pattern classification, it is not widely used for ROI definition in systems neuroscience. Perhaps it should be.

I'd like to do this. I see two possibilities – 4-fold cross validation (since I've already decided to divide the data using a ¾-¼ scheme) or perhaps a repeated k-fold strategy in which I take multiple overlapping random samples of our runs, perhaps 100, to generate clusters, and then use the excluded runs for the regression (this would be quite computationally expensive, among other concerns).

Here I arrive at several questions that are quite far outside my statistical expertise.

"Combine the results statistically" glosses over important details. Is it valid to take the mean of regression coefficients over all of the folds? How do we generate confidence intervals for the regression coefficients? The regression coefficients for each k-fold are really the result of two stochastic processes, the process that generated the clusters and the process that generated the regression coefficients. I am unclear on a valid strategy for characterizing the uncertainty around the regression coefficients whether using a 4-fold or a repeated k-fold strategy.

Is the repeated k-fold strategy legitimate?

Thanks in advance for any input.

The full reference for the paper I cited is:

Kriegeskorte, N., Simmons, W., Bellgowan, P. et al. Circular analysis in systems neuroscience: the dangers of double dipping. Nat Neurosci 12, 535–540 (2009)

You split the data so that you use all the subject to come up with the clusters and all of the subjects to fit the regression. Isn't it more convincing to split the subjects instead? That being said, I don't understand what you mean by clusters and relevance contrasts. The first question to answer might be how to validate the clusters. — dipetkov, Jul 08 '22 at 17:54
@dipetkov Thanks for the correction on the citation; I thought I had included a link -- I don't know what happened to it. I'll edit. As for your questions:
Why subjects and not runs: my intuition is that a people are more likely to have meaningful systematic differences than runs within people (I often get singularity warnings related to no variance being attributable to run, and not with subject). I want to capture that variance for the meaningful cluster definition and meaningful regression. In any case, it's standard in the field and would be borrowing trouble to do it the other way. — Katie, Jul 09 '22 at 12:24
Clusters: in task fMRI, you run two levels of analyses. One, per subject, regresses the BOLD signal on a time series of the events in the experiment. To simplify, these regressors are in one of two relevance categories, R1 or R2. You then generate contrasts for each subject with a subtraction of the resulting betas at each voxel in the brain, Beta_R1-Beta_R2. You then do t-tests over all subjects and all voxels to find voxels where the contrast differs from 0, with corrections for multiple comparisons that treat adjacent positives as more likely real: adjacent activations are clusters. — Katie, Jul 09 '22 at 12:48
When I say RelevanceContrast_brain, I mean that I get values for the R1 and R2 betas within each subject, within each run, and within each level of the Taught variable, and I subtract R2-R1 (and normalize them). That's that variable.
When I say RelevanceContrast_accuracy, I get values for the mean accuracy within each subject, within each run, and within each level of the Taught variable, and normalize them. — Katie, Jul 09 '22 at 12:52
The first question to answer might be how to validate the clusters.

I'm not sure what you mean, but to clarify: my problem is that I am forbidden from conducting my regression on the same data I used to generate the clusters. So any approach to "validating the clusters" that defines clusters using all the data doesn't help. I'm looking for a way to use most of the data in the cluster defining step, conducting a regression with an independent subset of data, but do that multiple times, subsetting the data different ways, to improve the accuracy of my regression coefficients. — Katie, Jul 09 '22 at 12:59
I think this CV question is pretty relevant: https://stats.stackexchange.com/questions/69831/confidence-intervals-for-cross-validated-statistics — Katie, Jul 09 '22 at 12:59
Thank you for the additional explanation; it's helpful. I understand the need to validate the analysis, I'm just not sure cross validation is the way to do it; see here. What I mean by validating the clusters is: if you split the data k times and repeat the whole analysis k times, do you get the same k clusterings? Your question is about combining regression coefficient estimates but actually we want to validate every step of the analysis. — dipetkov, Jul 09 '22 at 13:17
The nested cross-validation approach mentioned in Confidence intervals for cross-validated statistics is computationally expensive, so perhaps not practical in your case. You also have a factorial design that you'll "break" if you split randomly. — dipetkov, Jul 09 '22 at 13:50
I'm wondering if the following will be reasonably convincing: Split the data either 2 or 4 times, so that you have, for each subject, a run of each TargetType in both splits. Then replicate the entire analysis (clustering + regression) and then simply report either 2 or 4 estimates of the regression coefficients. If these agree (are not very different), then you are done; otherwise you at least know that there is a lot variability and 2 or 4 splits are not enough anyway and no method for combing the results of 4 replicates will give reliable results. — dipetkov, Jul 09 '22 at 13:51
"if you split the data k times and repeat the whole analysis k times, do you get the same k clusterings?" No, not exactly. This is why I allocate 3/4 of the data to cluster definition -- because a minimal condition for meaningful results is getting similar clusters. When I allocate 3/4 of the data to cluster definition, they are similar to the clusters I get using all data. As for breaking my factorial design, I've been splitting pseudorandomly, in that in a 3/4-1/4 split, one run within subject and within Taught condition goes into the regression data (as you describe in your last comment) — Katie, Jul 09 '22 at 15:21
If these agree (are not very different), then you are done

This seems reasonable. The other thing that seemed like a possible strategy from comments in the link I posted is averaging the upper and lower limits of the four obtained confidence intervals. — Katie, Jul 09 '22 at 15:28

score 0 · Answer 1 · answered Jul 05 '22 at 19:13

0

Not sure this will answer your question, but to my understanding I would do this:

If I want to find the relationship between variables and target/response, then following are the methods I am aware of

Sequential Forward Selection - Finds top n variables/columns/features that gives best result by starting from 1 variable to n variable
Sequential Backward Selection - Finds top n variables/columns/features that gives best result by starting from all variable to n variable (number of variables that you want to keep/use)
Finding certainty that so and so variable is proving more important consistently
- do the bootstrapping (suppose) 100 times
- I assume that you are using logistic regression for this
- train the model, fit the data, find the coefficients (also called feature importance), store them
- do the above step for all bootstraps
- Now you have coefficients (100) for every variable / feature.
- Find the confidence interval of these 100 coefficients
- the less the variance is the more certain this coefficient is in predicting the response
- make a separate list of these features/variables
- make new model and use only these features to make predictions. Why? because we observed that the coefficients of these features have very less variance.

Now if you use different model to train (suppose you decided DecisionTree next) then this might give you different significant features. Why? DT will give features that are significant deemed by DT for prediction.

answered Jul 05 '22 at 19:13

otaku

11

I really appreciate the engagement. I'm not sure I understand this as an answer to my question, and in fact, I think my question has changed since I asked it. I'm trying to decide whether to delete this question and ask a new one. In this question, I was trying to understand whether there was any problem with averaging several regression coefficients in a model across data splits. I think it probably is. Averaging is generally what you do to stabilize an estimate -- it's just the fact that it's embedded in a set of relationships that confuses me. – Katie Jul 06 '22 at 00:32
My new question is that I'm not sure of the right way to generate CI's. Basically I'm flummoxed because what I want to do sort of looks like a lot of standard techniques, but isn't exactly any of them. – Katie Jul 06 '22 at 00:34
Should you average the coefficients? My answer - NO. Why? because if the variable is not significant in prediction, averaging will be same as doing linear regression. Should you average coefficients produced in k-fold? answer - NO. Why? because the pattern found in Kth fold might be different than in other folds. Whole purpose of k-fold validation is to find parameters that give consistently accurate predictions. – otaku Jul 06 '22 at 05:18
Thanks again! I think I haven't explained my situation, and maybe I should delete this question and start again. I used the term "k fold cross validation" in part because the paper whose recommendations I am trying to take, but it's really an analogy. I'm not training, testing, or making predictions. I tried to clarify that, but obviously not well enough. I am using one segment of the data to generate brain regions, and another segment of the data to perform a regression on data that comes from those brain regions. – Katie Jul 06 '22 at 09:45
All I am trying to do is get a stable estimate of the effects that are in all of my data, as if I ran a linear regression with all of my data, without violating the requirement that for any given regression, I must have selected regions of the brain using data that is independent for what I use for the regression. Prediction or anything of the sort is far beyond me -- really I'm sure this study is underpowered to investigate the complicated relationships I'm seeing, but that only gets worse when I use I fraction of the data. – Katie Jul 06 '22 at 09:56
Ok, I've totally rewritten my question. Hopefully it's clearer now. – Katie Jul 06 '22 at 14:31

How to statistically combine regression coefficients derived from subsamples of data

1 Answers1

Linked