2

I have a data set, consisting of positive and negative patients (virus infection). If the patient is negative, it has 0 as outcome (y), if it is positive it has a positive value, up to 100. The input (x) values are numeric too and I want to predict the y values just out of the x values. x consits of more than one variable. In the group of the negative patients at least one x variable contains a lot of zeros too.

Is there a possibility to do some methods like PCR (principle component regression), PLS, Lasso, Ridge, glmnet (all these methods work fine if I just analyze the positive group) or don't they work when the there are so many zeros in the data? Can I transform the data (log10) and add a one to avoid the zeros and therefore having better results? Must I do a classification in two groups first and then a regression for the positive group or is there a one-step possibility too?

Sally
  • 21
  • Can you elaborate why the negative patients have Y = 0? Is their viral load reading truly 0? Or perhaps the current Y is the result of a two-step procedure that classifies patients as positive or negative and then measures the viral load for the positive patients only. – dipetkov Apr 11 '22 at 15:15
  • To the best of our knowledge, common public reasearch results and our sample population their viral load is truly 0 and always 0 as long as they have/had no infection. – Sally Apr 11 '22 at 15:21
  • On the other side, a positive patient theoretically could have 0 too. But in my sample population that was not the case. – Sally Apr 11 '22 at 15:27

1 Answers1

3

Since there is only one way that a zero is generated, i.e. patient not having the virus equals zero, you should look into hurdle models (or zero altered models).

These "2-stage models" fit a binomial model first on whether a patient has the virus or not, and then fit a second model on the data without the zeros. Which model to use in the second step depends on the type of your data. If it is positive continuous, you can try a Gamma model, if the data are counts, you can try a Poisson or negative binomial model, or if the data are proportions (or percentages), you can try a beta regression model on the non-zero data.

If you search here on CV, you will also find many threads on hurdle and zero inflated models, e.g. What is the difference between zero-inflated and hurdle models?

Stefan
  • 6,431