exploratory data analysis on a large dataset with hundreds of variables

Question

I'm working on longitudinal data with repeated measures for each subject and hundreds of variables. I would like to use linear mixed model to look at the mean response of each dependent variable at each time point. However, as the number of variables is huge I would like to reduce it to most meaningful variables. Exploratory data analysis included descriptive statistics through frequency tables and spaghetti plots. Moreover, I looked at the correlation between variables, although the resulting correlation matrix is too huge to be visualized properly. My idea was to cut the number of variables based on high correlation values.

What else should I do? Are there any other convenient ways to look at longitudinal data other than LMM when you have lots of variables?

Chances are there are formal techniques for doing this such as versions of the Lasso penalty (though I can't tell what exactly out of the top of my head). You may want to change your title to include variable selection or dimension reduction if this is your main aim, because exploratory data analysis is not the main road for achieving this. — Christian Hennig, Oct 18 '23 at 10:08
@ChristianHennig Is it LASSO working also for longitudinal data? — Ed9012, Oct 18 '23 at 10:28
Do you have any theoretical framework regarding the relationships between the variables (at either between- or within-subject level)? Are you planning to use all variables at the within level in the linear mixed model? In general, a multilevel factor analysis might be useful. — Sointu, Oct 18 '23 at 10:32
@Ed9012 As I said, I expect so (or at least a version of it that may be called differently) but can't tell without doing more research than I have time for right now. I hope somebody else can explain this. — Christian Hennig, Oct 18 '23 at 10:35
@Sointu i have physiological indexes collected from different types of medical instruments, so that measures from the same tool are likely to be correlated. I would like to model the mean response of each index to time points. What do you mean by within level? — Ed9012, Oct 18 '23 at 10:45
Christian Henning's suggestion may be a better route, but I meant using a factor analysis for dimension reduction, and since you have multilevel data, you'd need to use multilevel factor analysis. If I understand correctly, your instruments would represent the between level and individual observations (physiological indices) would represent the within level. Here's a good intro to multilevel FA (it's a pdf). — Sointu, Oct 18 '23 at 11:09
@Sointu However the physiological variables are on a continuous scale, is it still a good idea to use FA? — Ed9012, Oct 18 '23 at 11:32
Factor analysis is for continuous variables ("factor" in factor analysis refers to a different concept than "factor" in, say, ANOVA). — Sointu, Oct 18 '23 at 11:38
Some other relevant questions: https://stats.stackexchange.com/questions/532215/can-you-use-lasso-for-variable-selection-of-fixed-effects-then-run-a-mixed-mode, https://stats.stackexchange.com/questions/90055/can-should-regularization-techniques-be-used-in-a-random-effects-model, https://stats.stackexchange.com/questions/271407/regularization-models-with-random-effects, https://stats.stackexchange.com/questions/105699/elastic-net-package-for-mixed-effects-models, — kjetil b halvorsen, Oct 26 '23 at 15:48

exploratory data analysis on a large dataset with hundreds of variables

0 Answers0