0

my data looks like that:

       > df
      ID gene     value   state milk meat veg pruits eggs water
1  ID001    1 0.3192414 healthy    2    3   1      1    2     1
2  ID001    2 0.2482127 healthy    2    3   1      1    2     1
3  ID001    3 0.7307897 healthy    2    3   1      1    2     1
4  ID001    4 0.4570438 healthy    2    3   1      1    2     1
5  ID001    5 0.3568216 healthy    2    3   1      1    2     1
6  ID001    6 0.7119822 healthy    2    3   1      1    2     1
7  ID002    1 0.2858343     ibd    4    2   5      3    4     2
8  ID002    2 0.7361333     ibd    4    2   5      3    4     2
9  ID002    3 0.4511978     ibd    4    2   5      3    4     2
10 ID002    4 0.4359156     ibd    4    2   5      3    4     2
11 ID002    5 0.2704967     ibd    4    2   5      3    4     2
12 ID002    6 0.8270154     ibd    4    2   5      3    4     2
13 ID003    1 0.2231976     ibd    4    2   5      3    4     2
14 ID003    2 0.9627712     ibd    4    2   5      3    4     2
15 ID003    3 0.1013859     ibd    4    2   5      3    4     2
16 ID003    4 0.3067185     ibd    4    2   5      3    4     2
17 ID003    5 0.1859476     ibd    4    2   5      3    4     2
18 ID003    6 0.5674584     ibd    4    2   5      3    4     2
> str(df)
'data.frame':   18 obs. of  10 variables:
 $ ID    : chr  "ID001" "ID001" "ID001" "ID001" ...
 $ gene  : num  1 2 3 4 5 6 1 2 3 4 ...
 $ value : num  0.319 0.248 0.731 0.457 0.357 ...
 $ state : chr  "healthy" "healthy" "healthy" "healthy" ...
 $ milk  : num  2 2 2 2 2 2 4 4 4 4 ...
 $ meat  : num  3 3 3 3 3 3 2 2 2 2 ...
 $ veg   : num  1 1 1 1 1 1 5 5 5 5 ...
 $ pruits: num  1 1 1 1 1 1 3 3 3 3 ...
 $ eggs  : num  2 2 2 2 2 2 4 4 4 4 ...
 $ water : num  1 1 1 1 1 1 2 2 2 2 ...

I want to determine which combination of diets (columns 5 to 10) has the most significant impact on the values of each gene – which model explains the most variance. the main goal is to identify the specific foods that influence gene values for each gene(and the sample's state). I used R's prcomp, fviz_eig, fviz_pca_biplot, lmer, and lm functions, but I find it difficult to create so many models... The values of the diets are ordinal (1- high consumption, 5-low consumption).

The best result I could imagine is to have a strong correlation between each gene's value and state, to a combination of diets.

My real data contain over a million rows, and each ID has hundreds of genes, but this is basically how it looks like, and what I want it to be made. Any help will be highly appreciated, thanks!

  • 1
    You "explain the most variance" by using all the variables. That's no basis for selecting variables. Please research our posts on model selection. Your stated goal of achieving a "strong" correlation isn't sufficiently quantitative to suggest objectively good answers. – whuber Aug 19 '23 at 16:15
  • This is generally not a good idea. This post https://stats.stackexchange.com/questions/8303/how-to-do-logistic-regression-subset-selection both says why it is not and shows a way to do it, if you insist. – Peter Flom Aug 19 '23 at 16:16
  • I'm struggling to understand how to perform model selection/ variable selection. should I use the LASSO function from "glmnet" library? Thanks! @whuber – Chemokine1 Aug 19 '23 at 19:18
  • I'm sorry but not really, I read about the "glmnet" library and LASSO function in one of the top comments. Should I apply that? @PeterFlom – Chemokine1 Aug 19 '23 at 19:21
  • If you insist on an automated, method, LASSO is perhaps best. – Peter Flom Aug 19 '23 at 19:31
  • I'm not insist on an automated method, I just don't know how to solve it otherwise. which statistical method would you recommend to approach this problem? @PeterFlom – Chemokine1 Aug 20 '23 at 08:09
  • Not a statistical method, a substantive one of thinking about it and consulting subject matter experts. – Peter Flom Aug 20 '23 at 12:10

0 Answers0