4

I am trying to do a LASSO regression on some data. However, my dependent variable is between 0 and 1. How do I go about this? Do I just apply a sigmoid function to the regression output?

This will surely force the outcome to the 0-1 range, but I am not sure of the technical implications.

Minaj
  • 1,421
  • Do you mean your dependent (output) variable? – Kevin Jul 28 '17 at 14:46
  • 2
    how would you model this data if you weren't interested in penalization during estimation? Answering this question should send you in the right direction. – user795305 Jul 28 '17 at 14:49
  • I am not sure either. So even if I was to do a straight linear regression, my question would still stand. – Minaj Jul 28 '17 at 14:58
  • 2
    If $x$ can take any value but $y$ is bounded between 0 and 1, then $y$ isn't a linear function of $x$. You've specified two properties of the function (that $y \in (0,1)$, and something about it being continuous). There are all kinds of crazy looking nonlinear functions that satisfy these properties. To get to the point of fitting a model, you'd have to be more explicit about the type of function you're looking for. – user20160 Jul 28 '17 at 15:35
  • Is it a continuous proportion or a count proportion you are modelling? – usεr11852 Jul 29 '17 at 08:04
  • its a continuous proportion – Minaj Jul 29 '17 at 19:22

3 Answers3

5

Since the response variable is between 0 to 1, i.e., you should perform a beta regression. The package 'gamlss' allows you to do that in addition to fit your model using Lasso.

library(betareg)
data(GasolineYield)
library(gamlss)

X <- with(GasolineYield, cbind(gravity,pressure,temp10,temp,batch))
# standarise data 1-------------------------------------------------------------
sX <- scale(X)
# ridge
m1 <- gamlss(yield~ri(sX), data = GasolineYield)
# lasso
m2 <- gamlss(yield~ri(sX, Lp=1), data = GasolineYield)
# best subset
m3 <- gamlss(yield~ri(sX, Lp=0), data = GasolineYield)

# summary
summary(m1)
summary(m2)
summary(m3)

# plotting the coefficients
plot(getSmo(m1))
plot(getSmo(m2))
plot(getSmo(m3))

There are some variations for beta regression. Take a look at the GAMLSS Manual.

2

I am not sure, but I think we can do

$$ \text{minimize}~ \|\frac 1 {1+e^{-X\beta}} -y \|_2^2+ \lambda\|\beta\|_1 $$

Where $X$ is the data matrix and $y$ is the response and $\beta$ is the coefficients. The objective is convex.

And

$$ 0< \frac 1 {1+e^{-X\beta}} < 1$$

Haitao Du
  • 36,852
  • 25
  • 145
  • 242
0

Let's say that the true relationship between predictors and response is (mostly) linear. In this case, you could do a regression and then truncate the outputs (i.e. anything below 0 counts as 0, anything above 1 counts as 1). This would be better than applying the sigmoid function.

If you used a sigmoid function, you'd want to do so while training the model (not simply applying it to a linear regression output); this would be better if your problem is closer to classification (i.e., most of your outputs are near 0 or near 1). (The betareg package manual mentions this idea too).

Ultimately, you'd want to use a plot of the data or some knowledge about its structure to make a final decision (per @user20160's comment).

Kevin
  • 411