-1

I'm totally new in R, I hope you can help me. I have a dataset with 3 different outcomes, basic clinical details (age, gender, previous infections), 3 numerical variables, and other 5218 variables (factors) representing nucleotide substitutions. This is the regression I have to use:

glm(pl ~ sero + gender + age + pca1n + pca2n + pca3n + x##nt, data = dataset, family = binomial)

where:

l: outcome of interest ero: serology age, gender pca1n-pca3n: numerical values generated by principal components x##nt: nucleotide at the ## position. It captures mutations in the "##" position starting in position 6 up to 10181. There are 5118 different x##nt variables, all factors having A,C,T,G as potential levels.

Here a description of my dataset: dataset

I have read tons of similar posts, finding two possible solutions:

  1. (How to make loop for one-at-a time logistic regression in R?), using this:
names(mydata)[grepl('rs', names(mydata))] -> pred #get all predictors that contain 'rs'

c(pred, paste0(pred, ' + age + sex')) -> pred

purrr::map_dfr(1:length(pred), function(i) data.frame(model = i, tidy(glm(as.formula(paste0('casecontrol ~ ', pred[i])), data = mydata, family = binomial))))
  1. I can't find the link, but this was another solution:
fit <- lapply(log2905f[,13:5230], function(x) glm(pl ~ sero + gender + age + pca1n + pca2n + pca3n + x, data = log2905f, family = binomial))

However, both reported the same error:

> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels

I assumed the problem was the database, so I have checked many times factors with only 1 level and that's not the case. I also found NA values could a source of problem but once all NA are removed, the same error is still reported.

I also use this: How to debug "contrasts can be applied only to factors with 2 or more levels" error?

Interestingly, when running this, some variables are characterized of having only one level but when I inspected them, they have more than 2 levels for sure. It is also interesting that when I run glm one by one, even the ones identified as problem by debug function, glm runs without problem. In summary:

glm(pl ~ sero + gender + age + pca1n + pca2n + pca3n + **x##nt**, data = dataset, family = binomial)

Only x##nt will change for each glm. Any idea how to solve this? Please, I'm desperate, need to submit my thesis in 2 months and this thing is delaying my progress excessively!

Jose
  • 419
  • 2
  • 10

0 Answers0