Relationship between laplace and l1 regularization

Question

It is well known that an L1 regularized linear regression is equivalent to a regression with a Laplace prior on the distribution of the coefficients. This is explained here:

https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization/

I would love to make use of this face and convert a bayesian model I have with a Laplace prior to a simple Lasso regression using sklearn, which is much faster. However, when I try to follow to formula for the conversion of the b for the Laplace prior and the alpha for L1 - I do not get the expected results. According to the article above, the conversion from the b scale parameter of Laplace to the alpha of Lasso should be 2*sig^2 /b.

Using pymc3 to implement the bayesian model with a Laplace prior:

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso
k = 2
n = 150
true_sigma = 1
true_alpha = 5
coefs = np.random.rand(k)3
X = np.random.rand(n, k)
y = (X  coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)
basic_model = pm.Model()
b = 3
with basic_model:
    alpha = pm.Normal("alpha", mu=0, sigma=10)
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfNormal("sigma", sigma=3)
mu = alpha + (beta * X).sum(axis=1)

Y_obs = pm.Normal(&quot;Y_obs&quot;, mu=mu, sigma=sigma, observed=y)


map_estimate = pm.find_MAP(model=basic_model)
now using Lasso
reg_alpha = 2true_sigma*2 / b
lr = Lasso(alpha=reg_alpha)
lr.fit(X, y)

We can now compare lr.coef_ and map_estimate["beta"]. I played with the data drawn, sigma and b and found that the results are rarely the same. If I manually change the reg_alpha, I can find a value that will produce similar coefs, but I cannot find a consistent formula.

Even if this was solved - the theoretical formula involves knowing the true sigma (noise), which obviously we cannot do. Is there no way to convert the bayesian model given some b to an equivalent Lasso with the correct alpha?

Edit:

We found a solution. first, there were a few inaccuracies in the pymc3 model, making it slightly in-equivalent to Lasso. They don't make much of a different, but the correct model would be:

with basic_model:
    alpha = pm.Flat("alpha")
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")
mu = (beta * X).sum(axis=1)  + alpha
Y_obs = pm.Normal(&quot;Y_obs&quot;, mu=mu, sigma=sigma, observed=y)


map_estimate = pm.find_MAP(model=basic_model)

We can then estimate the MAP using Lasso:

reg_alpha = map_estimate['sigma']**2 / (b * n)
lr = Lasso(alpha=reg_alpha, tol = 1e-6, fit_intercept=True)
lr.fit(X, y)

but it requires as estimate for the true error of the model using map_estimate['sigma']. However, we can iteratively re-estimate the error using this code:

tolerance = 1e-6
max_iter = 10
est_sigma = 1
for i in range(max_iter):
    reg_alpha = est_sigma*2 / (b  n)
    lr = Lasso(alpha=reg_alpha, fit_intercept=True)
    lr.fit(X, y)
    new_est_sigma = (lr.predict(X) - y).std()
    if abs(new_est_sigma - est_sigma) < tolerance:
        break
    est_sigma = new_est_sigma
    print(lr.coef_, est_sigma)

which runs much faster than pymc's code. So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

The answer you accepted points out an important difference. But, it's not the whole story. Even after accounting for the factor-of-n issue, there are important conceptual differences between lasso and your pymc3 model. — user20160, Mar 09 '22 at 15:12
like what? how can we fix the pymc3 model to be more like Lasso then? — Oren Matar, Mar 09 '22 at 15:16
"So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster." Effectively Lasso is some sort of coordinate gradient descent. But it can be fast because the path is in straight lines. I imagine that there are just as well good coordinate descent solvers for the maximum a posteriori probability (the reason for that MCMC stuff is when you wish for the entire curve and or when you have nested variables models that don't behave/converge so well). In that case the speed advantage may be less... — Sextus Empiricus, Mar 10 '22 at 11:12
... The downside of the Lasso equivalence with the Laplace prior is that it only works in the specific case with low informative priors on the other parameters. You can not generalize it easily when you wish to change some of the priors on other parameters. — Sextus Empiricus, Mar 10 '22 at 11:13

Sextus Empiricus · Accepted Answer · 2022-03-17T17:10:40.177

There are four points of improvement to make the relationship between the l1 regularization and the Bayesian MAP estimate equivalent.

1. Slightly different definitions of $\lambda$

The optimization function used in the article is

$$S(\alpha,\beta;x,y) = \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

The optimization function used by pythons sklearn has a factor $\frac{1}{2n}$ difference.

$$S(\alpha,\beta;x,y) = \frac{1}{2n}\sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 + \lambda \sum_{p=1}^2 \vert \beta_p\vert$$

So your code will works when you divide your reg_alpha by 2n.

See the manual

The optimization objective for Lasso is:

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

2. Using the right $\sigma$

In addition, as Tim mentioned, you need to use the $\sigma$ that is the optimum in the Bayesian model.

With $f_\sigma(\sigma)$, $f_\alpha(\alpha)$, $f_\beta(\beta) = \frac{1}{2b} e^{-\frac{|\beta|}{b}}$ your prior functions, and $h(x,y)$ a normalization function, the logarithm of the posterior density is the following

$$\begin{array}{rcllllllll} S(\alpha,\beta,\sigma;x,y) &=& \rlap{\overbrace{\phantom{n -\frac{n}{2} \log 2\pi - n \log \sigma + \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 }}^{\text{likelihood}}} n \log \frac{1}{\sqrt{2\pi\sigma^2}} &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \overbrace{ \log f_\beta(\beta_1) + \log f_\beta(\beta_1)}^{\text{prior of beta}} &+& \overbrace{ \log f_\sigma(\sigma) + \log f_\alpha(\alpha) }^{\text{prior other parameters}} &+& \overbrace{\log h(x,y)}^{\text{normalization function}}\\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &+& \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} + \log \frac{1}{2b} e^{-\frac{|\beta_1|}{b}} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \\ &=& -\frac{n}{2} \log 2\pi - n \log \sigma &+& \rlap{\underbrace{\phantom{\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \sum_{p=1}^2 |\beta_i| }}_{\text{part that depends on the $\beta_i$}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 &-& \frac{1}{}b\sum_{p=1}^2 |\beta_i| + \log \frac{1}{2b} &+&\log f_\sigma(\sigma) + \log f_\alpha(\alpha) &+& \log h(x,y) \end{array}$$

The part that depends on the $\beta$ is to be compared with the cost function.

$$\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{1}{b} \sum_{p=1}^2 |\beta_i| $$

or multiplied by $\frac{\sigma^2}{n}$

$$\frac{1}{2n} \sum_{i=1}^n (y_i- (\alpha + \beta_1 x_{1,i} + \beta_2 x_{2,i}))^2 - \frac{\sigma^2}{nb} \sum_{p=1}^2 |\beta_i| $$

this is the same as the first (l1 regularization) cost function if $$\lambda = \frac{\sigma^2}{nb}$$

Here $\sigma$ is the $\sigma$ that is part of the maximum a posteriori probability estimate.

3. Prior on $\alpha$

When you have a prior on $\alpha$ then you get a different estimate for this parameter. This will influence the estimates of the $\beta$ since the $X$ variables correlate with the intercept and they will correct for the the different $\alpha$.

4. Accuracy of the computations

Another source of variation between the two methods is that the algorithms might not converge accurately and due to early stopping you can get slightly different answers.

Code example

The code below combines the four points above

import numpy as np
import pymc3 as pm
from sklearn.linear_model import LinearRegression, Lasso
np.random.seed(1)
k = 2
n = 150
true_sigma = 1
true_alpha = 5
coefs = np.random.rand(k)3
X = np.random.rand(n, k)
y = (X  coefs + true_alpha + np.random.rand(n, 1) * true_sigma).sum(axis=1)
basic_model = pm.Model()
b = 1
with basic_model:
    alpha = pm.Flat("alpha")                       # different prior
    beta = pm.Laplace("beta", mu=0, b=b, shape=k)
    sigma = pm.HalfFlat("sigma")                       # different prior
mu = alpha + (beta * X).sum(axis=1)

Y_obs = pm.Normal(&quot;Y_obs&quot;, mu=mu, sd=sigma, observed=y)

step = pm.NUTS()
pm.sample(2000, tune = 1000, step = step)      # different tolerance to have the algorithm converge further


map_estimate = pm.find_MAP(model=basic_model,  maxeval=10**6)
#print(map_estimate['sigma'])
now using Lasso
reg_alpha = map_estimate['sigma']**2 / b / n                 # different alpha
lr = Lasso(alpha=reg_alpha, tol = 1e-6, max_iter = 10^5)     # different tolerance to have the algorithm converge further
lr.fit(X, y)
print("\n")
print(lr.coef_)
print(map_estimate['beta'])

output

[1.15586011 2.45047217]
[1.15586059 2.45047129]

Why is l1 regularization so much faster than find_MAP?

Your question seems to be motivated by the use of Lasso as a faster alternative.

So given a bayesian regression with a Laplace prior with scale b it is possible to use Lasso and get similar results, at least 100x faster.

But the Pymc3 library is an overkill for this problem and that is why it works so relatively slow. The find_MAP function is computing the posterior function by means of (slow) sampling.

This is unnecessary for this problem where an expression for the posterior is already known.
Pymc3 and sampling with a monte carlo method is useful when the expression for the posterior is too complex to be computed by hand or with a computational method or algebraic software program.

For instance, the posterior could involve a step like integration to marginalize out nuisance parameters. This is also already the case here when we look for the posterior of the $\beta$ only. E.g. when we look to maximize the marginal distribution of $\beta$

$$S(\beta;x,y) = \iint S(\alpha,\beta,\sigma;x,y) d\alpha d\sigma$$

^{But to be honest, I am not sure whether find_MAP is optimizing to find the maximum of the joint distribution or to find the individual maxima of the marginal distributions.}

That's more or less where I started, but if you play with the values of b, and re-draw new Xs and ys randomly a few times, you'll find that this conversion is not consistent. Sometimes it works, and often it doesn't.. there's always some value of reg_alpha that will produce near identical results to the map_estimate, but it's not as simple as the suggested formula. If you get this to work consistently for b between 0.5 and 20 please send me the code.. — Oren Matar, Mar 09 '22 at 14:10
I tried that version too.. does not produce consistent results... if this works for you with b=0.5, 1,3,5,10,20 and for 3-4 new draws of X,coefs,y - you can get the bounty. for me it doesn't.. — Oren Matar, Mar 09 '22 at 14:14
Previously I got by far larger differences, but following you example I changed the HalfNormal sd to 100, maxeval=10**6, and tol = 1e-6 and now the differences are by far more reasonable. Although they are still a bit too large sometimes, especially for small values of b. Anyway you get the bounty. Bonus question: is there anyway at all we can estimate the sigma without training the pymc3 model? just a ballpark estimate, so we can only use Lasso as it is so much faster... my thinking is maybe iteratively train a Lasso, find its sigma (errors) and train a new one based on the new value... — Oren Matar, Mar 09 '22 at 15:05
And still getting large errors for b in the range of 0.1 - 0.5.. I can live with that maybe but would love to get a theoretical explanation for why it is so much worse on that range — Oren Matar, Mar 09 '22 at 15:12
Which prior for the intercept would be the best for the pymc model to be as much as Lasso as possible? just increase it's std? — Oren Matar, Mar 09 '22 at 15:14
Doesn't seem to make much of a difference in my experiments, but I guess it is more precise with pm.Flat — Oren Matar, Mar 09 '22 at 15:21
about the same.. I don't think the problem lies there. Anyway if we want to remove the dependency in the pymc3 estimation of the sigma, the iterative approach works pretty well: est_sigma = 10 for i in range(10): reg_alpha = 2est_sigma*2 / b / n lr = Lasso(alpha=reg_alpha, tol = 1e-6) lr.fit(X, y) est_sigma = (lr.predict(X) - y).std() print(lr.coef_, est_sigma)
regardless of the initial est_sigma, we converge pretty quickly.. and it is still much faster than pymc... — Oren Matar, Mar 09 '22 at 15:26
Great! Using this approach we can get Facebook's Prophet model for time series forecasting to run at least 100X faster (they're using stan, but I'm guessing it wouldn't make a huge differenc) — Oren Matar, Mar 09 '22 at 15:30
If you can get a good result without the intercept that's great, we can always center the data before fitting. but i'm not getting a smaller diff without it... if you get a measurable improvement pls send me the new code? — Oren Matar, Mar 09 '22 at 16:16
to my knowledge lasso in sklearn does not allow separate penalties.. but if the intercept really is the issue then we can use the same method as sklearn - center the data before the pymc model and not give the model any intercept. I'm still not sure why this would cause the sklearn model to always over-regularize compared with the pymc one, and why the difference would be more significant for lower b values... But this is the most interesting cooperative investigation i got from a stackexchange question I ever got, so thanks for that. would be amazing to crack this one. — Oren Matar, Mar 09 '22 at 20:59
use reg_alpha = map_estimate['sigma']**2 / b / n Which is the same as what we had but without another 2 in the numerator, and it works perfectly. now replace the MAP estimate for sigma with iteratively estimating it using Lasso, and using the new value for a new Lasso - and the model is easily solved without pymc at all. — Oren Matar, Mar 09 '22 at 21:51
re-reading the documentation - it actually says the objective function has 2 * n_samples) in the denominator, so the 2 would cancel out with the 2 derived from the solution to MAP for Laplace in the numerator. The results are so highly correlated with the new formula, it puts all of the really good arguments for why we shouldn't expect a precise solution to shame. — Oren Matar, Mar 10 '22 at 07:08
@OrenMatar I have cleaned upped the answer and deleted comments here to make it less bulky. — Sextus Empiricus, Mar 16 '22 at 18:29
Hi, thanks for all of your help, it helped me create this faster fit method for facebook's Prophet for time series: https://medium.com/towards-data-science/need-for-speed-optimizing-facebook-prophet-fit-method-to-run-20x-faster-166cd258f456 — Oren Matar, May 11 '22 at 06:21
@OrenMatar that's a lot of nice stuff to read through. Did you eventually implement the method where you itterate re-estimations of $\sigma$? One could also do this at the same time as estimating the lasso coefficients and use some sort of gradient descent. That might be another efficiency improvement because you don't have to itterate several times the estimation. In the end the LARS regression is a sort of gradient descent as well. — Sextus Empiricus, May 11 '22 at 08:20
I did both - check out the post. Initially I iteratively created new Lassos, and then implemented my own regression in pytorch and re-estimated sigma every 50 iterations. Compared it on a few datasets to the Bayesian model and got exactly the same results. — Oren Matar, May 11 '22 at 10:08

Tim · Answer 2 · 2022-03-09T13:57:09.477

2

Lasso regression (using $\ell_1$ regularization) with regularization parameter $\lambda$ is equivalent to using Laplace priors with mean zero and scale $\tau = 1/\lambda$ (see Tibshirani, 1996).

There are however formal differences between the two models you mentioned:

The $\ell_1$ model does not assume any Half-normal prior for variance. For equivalence, it should something like a flat prior $p(\sigma) \propto 1$ on the whole $0$ to $\infty$ range (but it is not a good choice).
In the $\ell_1$ scenario you are cheating a little bit because you use true_sigma for initialization, while the Bayesian model knows nothing about it.

Moreover, while using Laplace priors vs $\ell_1$ regularization are equivalent, this doesn't mean that you should expect exactly the same results. There would be a ton of implementational details that could make a difference (scaling of the data, regularizing the intercept, initialization, etc). In both cases you are also likely using a different optimization algorithm, that could also give different results. In particular, PyMC's find_MAP is a toy implementation not meant for any serious use

while PyMC3 provides the function find_MAP(), at this point mostly for historical reasons, this function is of little use in most scenarios.

Finally, as discussed by Sara Van Erp et al (2018), in practice those priors do not work as well as you would expect.

If you would like to do a valid comparison, the best approach would be to write down all the code yourself from scratch, so that all the details are the same, use exactly the same optimization algorithm, etc. Such code likely wouldn't be as good as any of the implementations you used, but you would be sure that there are no "technical details" that lead to different results.

edited Mar 09 '22 at 13:57

answered Mar 08 '22 at 14:57

Tim

138,066

I'm ok not getting the exact same results but what I'm getting is very far off. and far when using the MCMC sampling and not just the find_MAP. – Oren Matar Mar 08 '22 at 15:09
@OrenMatar see my edit, besides the things mentioned initially there are formal differences between the two models. – Tim Mar 08 '22 at 15:09
Also, how did you compare the MCMC results to the second scenario? To compare it, you need to estimate the mode of the posterior distribution, so this is another layer of approximation & another algorithm that influences the results. – Tim Mar 08 '22 at 15:11
These are all great points. But I am still somewhat surprised that I'm getting results that are completely off.. sometimes the Lasso returns all coefs as 0s, and the Bayesian model returns a good prediction. Can this truly be attributed to slight differences in the optimization method, and the assumption of a half normal variance? for the MCMC results I used either the mode, the find MAP or predicted using the entire posterior distribution, which is easy enough in pymc3. – Oren Matar Mar 09 '22 at 08:24
I'm running the code with b=10 for the Laplace prior, which is a very uninformative prior, as the real coefs are +-1, I'm getting a fine result from pymc but for the Lasso all coefs are 0s.. so the regularization is way to strong. Am I not missing anything else other than a few differences in optimization..? – Oren Matar Mar 09 '22 at 08:30
@OrenMatar as said you use different parametrization and different optimization algorithms, so basically, those are different models. "Very uninformative" prior vs flat prior from 0 to $\infty$ is not the same. The scikit-learn model uses a specific value for alpha that has nothing to do with PyMC model declaration. I see nothing surprising in the results being different. – Tim Mar 09 '22 at 09:23
See my edit to the question. there's definitely a strong correlation between the two models, with some alpha producing near identical results for each b prior (and the small diff could be a results of the reasons you mentioned). The only question is what is the transformation from b to alpha. I don't think I got an answer to this question yet, but you've been quite helpful and informative so you get the bounty. – Oren Matar Mar 09 '22 at 10:51
@OrenMatar see my edit alpha parameter in scikit-learn is the scale for Laplace distribution, but the two models you use are different + technical details, so you should not expect the same results. – Tim Mar 09 '22 at 11:13
The alpha is not the scale for the Laplace - a lower alpha means less regularization, whereas a lower scale for Laplace means more regularization (stronger prior on 0). They are inversely correlated, but not sure how to convert one to the other. – Oren Matar Mar 09 '22 at 12:36
@OrenMatar edited: so it's 1/alpha if that's sklearn's implementation. – Tim Mar 09 '22 at 13:22