I have a regression problem and I am thinking of using ridge regression. One of the predictors is subject's gender, which is a categorical variable. How to take care of this variable for ridge regression modeling? Can it be encoded as $0$ and $1$? What to do with a categorical variable with more than $2$ categories?
2 Answers
You are correct to assume that a categorical variable is encoded as indicator function/vector in your design matrix $X$; this is standard. To that respect, usually one of the levels is omitted and subsequently treated as baseline (if not you would have surely a rank-deficient design matrix when incorporating an intercept).
If you have a categorical variable with multiple categories you will once more treat is as an indicator function in your design matrix $X$. Just now you will not have a vector but a smaller submatrix. Lets see and example with R:
set.seed(123)
N = 50; # Sample size
x = rep(1:5, each = 10) # Make a discrete variable with five levels
b = 2
a = 3 # Intercept
epsilon = rnorm(N, sd = 0.1)
y = a + x*b + epsilon; # Your dependant variable
xCat = as.factor(x) # Define a new categorical variable based on 'a'
lm0 = lm(y ~ xCat)
MM = model.matrix(lm0) # The model matrix you use
image(x = 1:ncol(MM), y = 1:N, z=t(MM)) # The matrix image used. / Red is zero
As you see the levels 2,3,4 and 5 are encoded as separate indicator variables along the columns 2 to 5. The columns of ones in the column 1 is your intercept, level 1 is automatically omitted as an individual column and is assumed active alongside the intercept.
OK, so what about the ridge parameter $\lambda$? Remember that ridge regression is essentially using a Tikhonov regularized version of the covariance matrix of $X$. ie. $\hat{\beta} = (X^TX + \lambda I)^{-1} (X^T y)$, to generate the estimates $\hat{\beta}$. That is not problem for you if you have discrete (categorical) or continuous variables in your $X$ matrix. The regularization takes part outside the actual variable definition and essentially "amps" the variance across the diagonal of the matrix $X^TX$ (the matrix $X^TX$ can be thought of as a scaled version of covariance matrix when that the elements of $X$ are centred).
Please note that, as seanv507 and amoeba correctly comment, when using ridge regression it might make sense to standardise all the variables beforehand. If you fail to do that, the effect of regularisation can vary substantially. This is because increasing the observed variance of a particular variable $x$ by $\lambda$ can massively alter your intuition about $x$ depending on the original variance of $x$. This recent thread here shows such a case where regularization made a very observable difference.
- 44,125
Yes you can, your beta_ridge will be a number that will pop up when gender is 1 and won't have an effect when gender is 0. If you have more than one category, in my experience, make them all binary. e.g. if you have apple, orange, pear, instead of saying 1,2,3 say is_apple = [0,1] is_orange=[0,1] is_pear=[0,1].
- 481
-
Thanks. So you mean to say that instead of using a single variable, make it to three variables which are orange (0/1), apple (0/1) and pear (0/1) – prashanth Dec 10 '15 at 11:59
-
1@user3761534 If you want you use all the levels of your
fruitvariable you will need to make sure you do not use an intercept. If you do have an intercept you will need to omit one of them. Otherwise your design matrix will be rank-deficient. This is whysexis usually a single indicator and "pops up when gender is 1'. – usεr11852 Dec 10 '15 at 12:08 -
-
@usεr11852 Im not sure if identifiability is an issue when we have 1 equation and n>p, as opposed to a ,say, multinomial logit setting. If you could please provide a proof of that you would save me a lot of time :) . Another issue with using all the levels is that our betas will scale up and down based on the category value, in that pear might have a lower beta than apple even if their effects were equal. – maininformer Dec 10 '15 at 19:23
-
I do not comment on identifiability; I am talking abut rank-deficiency of your design matrix in the case you write out the full
fruitvariable and use an intercept. See the example in my answer. If you have "one equation" it means you already did what I describe, ie. you did not use all the available levels. – usεr11852 Dec 10 '15 at 19:47 -

1) the model simply assumes that the all the other variables are inactive. When level2is active the $\beta$ coefficient is the effect of level2taking into account the effect of level1. – usεr11852 Feb 25 '16 at 08:55