my data looks like that:
> df
ID gene value state milk meat veg pruits eggs water
1 ID001 1 0.3192414 healthy 2 3 1 1 2 1
2 ID001 2 0.2482127 healthy 2 3 1 1 2 1
3 ID001 3 0.7307897 healthy 2 3 1 1 2 1
4 ID001 4 0.4570438 healthy 2 3 1 1 2 1
5 ID001 5 0.3568216 healthy 2 3 1 1 2 1
6 ID001 6 0.7119822 healthy 2 3 1 1 2 1
7 ID002 1 0.2858343 ibd 4 2 5 3 4 2
8 ID002 2 0.7361333 ibd 4 2 5 3 4 2
9 ID002 3 0.4511978 ibd 4 2 5 3 4 2
10 ID002 4 0.4359156 ibd 4 2 5 3 4 2
11 ID002 5 0.2704967 ibd 4 2 5 3 4 2
12 ID002 6 0.8270154 ibd 4 2 5 3 4 2
13 ID003 1 0.2231976 ibd 4 2 5 3 4 2
14 ID003 2 0.9627712 ibd 4 2 5 3 4 2
15 ID003 3 0.1013859 ibd 4 2 5 3 4 2
16 ID003 4 0.3067185 ibd 4 2 5 3 4 2
17 ID003 5 0.1859476 ibd 4 2 5 3 4 2
18 ID003 6 0.5674584 ibd 4 2 5 3 4 2
> str(df)
'data.frame': 18 obs. of 10 variables:
$ ID : chr "ID001" "ID001" "ID001" "ID001" ...
$ gene : num 1 2 3 4 5 6 1 2 3 4 ...
$ value : num 0.319 0.248 0.731 0.457 0.357 ...
$ state : chr "healthy" "healthy" "healthy" "healthy" ...
$ milk : num 2 2 2 2 2 2 4 4 4 4 ...
$ meat : num 3 3 3 3 3 3 2 2 2 2 ...
$ veg : num 1 1 1 1 1 1 5 5 5 5 ...
$ pruits: num 1 1 1 1 1 1 3 3 3 3 ...
$ eggs : num 2 2 2 2 2 2 4 4 4 4 ...
$ water : num 1 1 1 1 1 1 2 2 2 2 ...
I want to determine which combination of diets (columns 5 to 10) has the most significant impact on the values of each gene – which model explains the most variance. the main goal is to identify the specific foods that influence gene values for each gene(and the sample's state). I used R's prcomp, fviz_eig, fviz_pca_biplot, lmer, and lm functions, but I find it difficult to create so many models... The values of the diets are ordinal (1- high consumption, 5-low consumption).
The best result I could imagine is to have a strong correlation between each gene's value and state, to a combination of diets.
My real data contain over a million rows, and each ID has hundreds of genes, but this is basically how it looks like, and what I want it to be made. Any help will be highly appreciated, thanks!