1

My assignment question is quoted: "2. Which set of variables best predicts handgrip strength in women? a. Reduce the number of continuous variables before doing the analysis."

I do not really know how to reduce the number of continuous variables. The variables I Have are: • Vnr = subject number • Sex = sex (0= female, 1= male) • Lft = age (years) • Leo = waist circumference (cm) • Sbd1 = systolic blood pressure (mmHg) • Dbd1 = diastolic blood pressure (mmHg) • Gluc = glucose (mg/dl) • Trig = triglycerides (mg/dl) • Hdl = hdl cholesterol (mg/dl) • Eberoepcal = energy expenditure during occupation (cal/wk) • Esportcal = energy expenditure during port (cal/wk) • Eavtcal = energy expenditure during leisure time (cal/wk) • PAL = physical activity level • Tkijk = tv time (hrs/wks.) • BIAP = vetpercentage via bio elektrische impedantie (%) • RUSTP = hartslag pols in rust (bpm) • VO2A = max oxygen consumption - absolute (L/min) • HGR= hand grip strength (kg)

So, first I probably need to reduce the ones that are not predicting/correlated enough with 'handgrip strength' to make the analysis I have to do after the reduction of continuous variables.

Secondly, which analysis would be the best to obtain 'the best set of predictors'?

I would solve this question in the following way;

  1. Dimensionality reduction: Apply a dimensionality reduction technique to the continuous variables to reduce their number. -> Principal Component Analysis (PCA).
  2. Predictive modeling: Use a predictive modeling technique to determine which of the reduced set of variables best predicts handgrip strength (HGR). -> Multiple linear regression.
  3. Model evaluation: Evaluate model using appropriate metrics (like Mean Squared Error for regression tasks) and cross-validation techniques to ensure its predictive performance.

Can anyone confirm this, or optimize it if I am wrong?

  • Im working in R by the way... – Nathan Vermaerke Dec 22 '23 at 13:08
  • 3
    You write that you must reduce the number of variables. But why must you do this? You put it in quotes, so maybe someone told you this, but, unless there is more context, I don't see any reason you would need (or even want) to do this. PCA is a method for reducing the number of variables. That's what it does. Also, if you do PCA and then regression on hand grip strength, you will not be able to answer the question you started with, as the PCs will be combinations of all the variables. Eliminiating some variables before starting will make it worse as you can't find those. – Peter Flom Dec 22 '23 at 13:19
  • Yes, I know that is what is confusing me. I will quote my assignment question: "2. Which set of variables best predicts handgrip strength in women? a. Reduce the number of continuous variables before doing the analysis." Maybe I need to forward/backward selection first, but I still find it weird... – Nathan Vermaerke Dec 22 '23 at 15:39
  • 1
    Sounds like a bad assignment to me. Is the professor a statistician? – Peter Flom Dec 22 '23 at 15:59
  • But maybe I do not have to do PCA? – Nathan Vermaerke Dec 22 '23 at 20:11
  • Unfortunately, the instructions you mention may be a bit ambiguous. Are there other instructions than those you mention? Maybe the ambiguity is solved in some additional text or presentation that you do not mention. Anyway, you can still conduct some analysis. You said you plan to conduct PCA; so you will have a reduced number of variables from this PCA, as explained by Peter Flom. What do you think you can do with these new variables, in order to predict handgrip strength in women? Can you think of any model to apply to this situation? – J-J-J Dec 23 '23 at 14:23
  • I think PCA is a way to reduce the amount of variables, but then it becomes unclear which set of variables 'best predicts handgrip strength in women'. Therefore, I think I need something else to first reduce the amount of variables (since I have 18 provided in my dataset). And lastly apply a predictor selection method, such as 'stepwise selection'. For clarity, I am provided with a dataset with 18 variables and my first question of the assignment is: "Which set of variables best predicts handgrip strength in women? Reduce the number of continuous variables before doing the analysis." – Nathan Vermaerke Dec 23 '23 at 15:34
  • 1
    I don't think that stepwise is a good idea, but others might disagree. Besides that, do you know that you can find the contribution of each of your 18 variables to each of the reduced variables resulting from PCA? See https://stats.stackexchange.com/questions/495342/pca-and-variable-contributions-to-first-n-dimensions Once you applied a model to predict the outcome (using the reduced variables as predictors), you could use this property for further analysis. – J-J-J Dec 23 '23 at 16:39
  • (BTW, I have to leave for a few days so I won't be able to follow this discussion, but if you edit your question with the details you mention in comments, give a couple more information about your dataset -e.g. is the outcome continuous? ordinal?- and your progress, it will attract attention of other people who might be able to give you an answer.) – J-J-J Dec 23 '23 at 16:41
  • @J-J-J, I did perform a PCA and this resulted in 8 principal components (8 dimensions). But now I do not know how to use this to determine "which set of variables best predicts handgrip strength in women?" – Nathan Vermaerke Dec 23 '23 at 19:02
  • 2
    On the comment about using supervised learning (forward/backward variable selection where Y is explicitly used) note that such methods are notoriously unreliable and lead to over interpretation. A good way to expose that, besides bootstrapping the entire process, is to compute confidence intervals for a variable importance measure such as partial $R^2$. See an example here. For this problem I'd consider sparse principal components, demonstrated in 2 chapters in RMS. – Frank Harrell Dec 24 '23 at 08:46
  • @FrankHarrell, Thanks for comment! Would it be better to use Criterion-based methods after performing the PCA? Such as: t-tests of parameter estimates, Model comparison using the F-test, R-squared, Mallow's Cp, and Information criteria (e.g., AIC, BIC)? – Nathan Vermaerke Dec 24 '23 at 08:57
  • 1
    Section 4.7 of Frank Harrell's Regression Modeling Strategies outlines principled ways to approach this problem, without using the outcomes to select predictors. – EdM Dec 24 '23 at 19:59
  • @NathanVermaerke all of those measures are supervised learning methods and essential just rescale highly problematic p-values. – Frank Harrell Dec 25 '23 at 10:30

0 Answers0