13

I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.

Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.

I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.

Any help would be appreciated!

Nad Pat
  • 2,995
  • 3
  • 9
  • 19

4 Answers4

20

You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.

set.seed(1)

# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5 

# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender  <- factor(rbinom(obs, 1, .5))

# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))

# extract just coefficients
sapply(my_lms, coef)

# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared, 
                                adj_r_sq = x$adj.r.squared))

The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.

arvi1000
  • 8,833
  • 1
  • 36
  • 50
  • 1
    This is great! Can I ask how you're coding age and gender into the equation? The way I did it before was to make a table containing age/gender, and set AgeGender –  Jan 15 '15 at 14:32
  • Strike that, it was actually much less complicated than I was making it! It seems to be working, thank you so much. –  Jan 15 '15 at 15:14
  • Glad it was helpful. Feel free to 'accept' the answer! :) Remember you can get info on any function by running `?function_name` at the console (e.g. `?rbinom`) – arvi1000 Jan 15 '15 at 15:16
  • Just accepted it, thanks for the tip. You wouldn't happen to have any additional advice for extracting the p-values and R-squared using this function, would you? –  Jan 15 '15 at 16:43
  • take a look at ?summary.lm – arvi1000 Jan 15 '15 at 17:48
  • To try to get R squared, I've tried using summary(lm(Lung[,x] ~ Blood[,x] + Age + Gender))$r.squared, but it keeps saying "incorrect number of dimensions". Any recommendations? –  Jan 16 '15 at 14:44
  • @JHall1020 I know it's been ages, but have you figured a way to find p.values for coefs and r-squared? – Guu Mar 16 '17 at 19:38
  • Hi, browsing this because I have a similar problem to OP. Why is it that you need to set the seed here? What is random? – BonnieKlein Feb 23 '21 at 22:37
  • just so the dummy data section is reproducible (the `rnorm()` calls ) – arvi1000 Feb 24 '21 at 19:01
2

The question seems to be about how to call regression functions with formulas which are modified inside a loop.

Here is how you can do it in (using diamonds dataset):

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot or do anything else with the result ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }
IVIM
  • 1,703
  • 1
  • 12
  • 32
2

Sensible or not, to make the loop at least somehow work you need:

y<- c(1,5,6,2,5,10) # response 
x1<- c(2,12,8,1,16,17) # predictor 
x2<- c(2,14,5,1,17,17) 
predictorlist<- list("x1","x2") 
for (i in predictorlist){ 
  model <- lm(paste("y ~", i[[1]]), data=df) 
  print(summary(model)) 
} 

The paste function will solve the problem.

Sarwan Pasha
  • 91
  • 1
  • 1
  • 8
1

A tidyverse addition - with map()

Another way - using map2() from the purrr package:

library(purrr)

xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]

map2_df(ys, xs,
        function(i,j){
          m <- lm(i ~j + x4 , data = anscombe)
          coef(m)
        })

The output is a dataframe (tibble) of all coefficients:

  `(Intercept)`     j      x4
1          4.33 0.451 -0.0987
2          6.42 0.373 -0.253 
3          2.30 0.526  0.0518

If more variables are changing this can be done using the pmap() functions

Tfsnuff
  • 161
  • 6