I wrote a version of this question yesterday, and I think in my effort to be brief, I wasn't clear. So I'm trying again.
I have questions about a cross validation for ensuring the independence of data used to define clusters in a task fMRI study, and the data used to calculate a regression predicting brain activation within those clusters from other experimental variables, and a trial by trial behavioral response.
Data collection method: task fMRI + a behavioral measure (a likert scale response captured by button press).
Experimental design: 3 x 2 x 2 factorial design, variables we call TargetType, Relevance, and Taught, respectively. All these variables are within-subject. Relevance and Taught vary within run. TargetType varies between runs. Each subject did 12 runs in the scanner (4 for each TargetType, assuming no missing data). 30 trials in a run. A behavioral variable, Accuracy, which is collected for each trial (varies from 1-6).
Analysis: First and second level whole-brain analysis of the fMRI data has already been performed, modeling every experimental condition in the first level and testing contrasts of those conditions at the second level. I also performed a mixed model for the behavioral data (Accuracy ~ TargetType * Relevance * Taught + (1|Subject/Run). Among other effects, there is a strong effect of Relevance in the fMRI data, yielding several significant clusters, and a strong effect of Relevance in the behavioral data.
I am interested in the relationship between the relevance effect in the brain and the relevance effect in behavior. I constructed Relevance contrasts in both the brain and behavioral data, averaged over a single cluster, and averaged within run and Taught condition, to give the following model:
RelevanceContrast_brain ~ TargetType * Taught * RelevanceContrast_accuracy + (1|subject/run) + (1|subject:cluster)
I have several questions about this analysis. Kriegeskorte (2009) alerts me that I can't use the same data to select clusters as we use to perform our regression onto our behavioral data. Following (with a variation) one of the recommendations in the supplemental of that article, I split the runs ¾-¼, with ¾ used to generate the clusters, and ¼ used to run the above regression, except, since there is now only one run per subject after this operation, the model becomes RelevanceContrast_brain ~ TargetType * Taught * RelevanceContrast_accuracy+ (1|subject/cluster)
Upon running this model, I get several significant regression coefficients, including three way interactions. These results are interpretable and I could just stop here. The problem: I suspect this isn't a very stable estimate of these effects. If, as an experiment, I run the "illegitimate" regression using the original clusters and all the data, I get important changes in regression coefficients.
In the supplemental to the article linked above, Kriegeskorte writes:
Crossvalidation is a form of data splitting. (It thus falls under “independent split-data analysis” in Fig. 4.) When we split the data into two independent sets, we may designate one set as the selection (or training) set and the other set as the test set. Obviously the opposite assignment of the two sets would be equally justified. Since the two assignments will not yield identical results, we are motivated to perform the analysis for each assignment and combine the results statistically, for greater power. This approach is the simplest form of crossvalidation: a 2-fold crossvalidation. An n-fold crossvalidation generalizes this idea and allows us to use most of the data for selection (or training) and all of the data for selective analysis, while maintaining independence of the sets. For n-fold crossvalidation, we divide the data into n independent subsets. For each fold i=1..n, we use set i for selective analysis after using all other sets for selection (or training). Finally, the n selective analyses are statistically combined. An n-fold crossvalidation for n>2 potentially confers greater power than a 2-fold crossvalidation, because the n-fold crossvalidation provides more data for selection (or training) on each fold. Crossvalidation is a very general and powerful method widely used in statistical learning and pattern classification. However, it is somewhat cumbersome and computationally costly. While it is standard practice in pattern classification, it is not widely used for ROI definition in systems neuroscience. Perhaps it should be.
I'd like to do this. I see two possibilities – 4-fold cross validation (since I've already decided to divide the data using a ¾-¼ scheme) or perhaps a repeated k-fold strategy in which I take multiple overlapping random samples of our runs, perhaps 100, to generate clusters, and then use the excluded runs for the regression (this would be quite computationally expensive, among other concerns).
Here I arrive at several questions that are quite far outside my statistical expertise.
"Combine the results statistically" glosses over important details. Is it valid to take the mean of regression coefficients over all of the folds? How do we generate confidence intervals for the regression coefficients? The regression coefficients for each k-fold are really the result of two stochastic processes, the process that generated the clusters and the process that generated the regression coefficients. I am unclear on a valid strategy for characterizing the uncertainty around the regression coefficients whether using a 4-fold or a repeated k-fold strategy.
Is the repeated k-fold strategy legitimate?
Thanks in advance for any input.
The full reference for the paper I cited is:
Kriegeskorte, N., Simmons, W., Bellgowan, P. et al. Circular analysis in systems neuroscience: the dangers of double dipping. Nat Neurosci 12, 535–540 (2009)
Why subjects and not runs: my intuition is that a people are more likely to have meaningful systematic differences than runs within people (I often get singularity warnings related to no variance being attributable to run, and not with subject). I want to capture that variance for the meaningful cluster definition and meaningful regression. In any case, it's standard in the field and would be borrowing trouble to do it the other way.
– Katie Jul 09 '22 at 12:24When I say RelevanceContrast_accuracy, I get values for the mean accuracy within each subject, within each run, and within each level of the Taught variable, and normalize them.
– Katie Jul 09 '22 at 12:52The first question to answer might be how to validate the clusters.
I'm not sure what you mean, but to clarify: my problem is that I am forbidden from conducting my regression on the same data I used to generate the clusters. So any approach to "validating the clusters" that defines clusters using all the data doesn't help. I'm looking for a way to use most of the data in the cluster defining step, conducting a regression with an independent subset of data, but do that multiple times, subsetting the data different ways, to improve the accuracy of my regression coefficients.
– Katie Jul 09 '22 at 12:59If these agree (are not very different), then you are done
This seems reasonable. The other thing that seemed like a possible strategy from comments in the link I posted is averaging the upper and lower limits of the four obtained confidence intervals.
– Katie Jul 09 '22 at 15:28