0

I'm trying to build intuition around how individual coefficients change as a regularization penalty is increased (for both ridge and lasso). This is what I understand the curves of the l1 and l2 complexity penalties look like:

enter image description here

Given that the complexity penalty for lasso the l1 norm, I would expect that increasing $\lambda$ would reduce all coefficients equally, and this is indeed what I see when using a modified version of this sklearn example:

enter image description here

Given that the complexity penalty for ridge is the l2 norm, I would expect that increasing $\lambda$ would prioritize reducing the largest coefficients. I initially thought that they would start decreasing earlier (for smaller $\lambda$ values) than the smaller coeffcients, but it seems like they start around the same time, but are decreased with a more aggressive slope, which I think also makes sense:

enter image description here

However, when I go outside these idealized examples and look at some charts "in the wild", I see all kinds of regularization paths, for example this one for lasso (from here):

enter image description here

Or this one for ridge (from here):

enter image description here

I notice there are both on another scale on the x-axis (and the last one is flipped), but it really seems like they are following different paths than the idealized examples I posted. What makes these paths different from the idealized scenario above? Is the model somehow trying to use the features differently as $\lambda$ increases? Or is my intuition wrong about what should happen an it is only in the idealized cases that we can expect the coefficients to shrink in a well-behaved manner?

  • For ridge, see this thread. To avoid duplicating it, I suggest you narrow your question down to just the lasso. (And it is possible that the latter has also been covered in some other thread; take a careful look at search results using relevant tags and keywords.) – Richard Hardy Nov 28 '23 at 09:01
  • Thank you @RichardHardy! This passage from one of the answers suggests that it is due to correlation between paramameters, which can lead to a tradeoff that still decreases the overall penalty: "Typically parameters decrease when we increase the penalty, but due to correlation it might be better that parameters decrease while simultaneously increasing another. This happens in the image with parameter 1. Increasing 1 makes that decreasing 2 and 3 coincides with less increase of the squared error part of the loss function". This makes sense to me so I'll mark this a a duplicate. – another_student Dec 01 '23 at 22:12
  • I couldn't find a way to indicate a duplicafte so I posted that comment as a community answer instead – another_student Dec 01 '23 at 22:15
  • 3
    The OP has indicated this is a duplicate: Intuition for nonmonotonicity of coefficient paths in ridge regression. I am voting accordingly. – Richard Hardy Dec 02 '23 at 08:49

1 Answers1

0

It seems like this has already been answered in this thread:

Typically parameters decrease when we increase the penalty, but due to correlation it might be better that parameters decrease while simultaneously increasing another. This happens in the image with parameter 1. Increasing 1 makes that decreasing 2 and 3 coincides with less increase of the squared error part of the loss function