How to automatically group categorical explanatory variables in a linear regression model?

Question

I have a dataset with $10^3$ to $10^4$ observations, where each observation consists of a single (scalar) response variable $y$ and $N \approx 10^2$ explanatory variables $X_1, ..., X_n$. All the explanatory variables are categorical (say belonging to categories "A", "B", or "C"), so I represent each one of them with a two-level dummy encoding (i.e. category "A" is taken as the reference level, category "B" is represented by the vector $(1, 0)$ and "C" by $(0,1)$). My objective is to fit and assess different linear models to predict $y$ given $(X_{i})_{1 \leq i \leq N}$.

Looking at the samples in my dataset, I can clearly see some significant group effects. For example, I do not see a strong cross-correlation between $y$ and the independent explanatory variables $X_1$, $X_2$ and $X_3$. However, when $X_1$, $X_2$ and $X_3$ vary together, their variation (as a group) seems to be strongly correlated to the variations of $y$. In my opinion, this suggests that $X_1$, $X_2$ and $X_3$ could be grouped into a new variable $Z_1$, and the linear model could be built with respect to this new explanatory variable, instead of the original $X_i$'s.

Now, my question is how to build the new grouped variables $Z_k$ in an automatic fashion. Ideally, I would like each group of variables to be as large as possible, and I would like the final linear model to be as sparse as possible. I have already looked at a few possibilities, but they are not completely suitable for my problem, so any suggestion would be very welcome.

1. PCA PCA applied as a dimensionality reduction method to the predictors $(X_i)_{1 \leq i \leq N}$ does not appear suitable for my problem for two reasons: (1) it does not include a response variable, and (2) it gives a linear combination of explanatory variables, whereas what I am looking for is a group of variables (roughly speaking a linear combination with binary coefficients).

2. Canonical Correlation Analysis CCA (or Cross Decomposition in Scikit-Learn), does incorporate the response variable information, but similarly to PCA it builds a linear combination of explanatory variables, which is not what I am looking for.

3. Linear Discriminant Analysis Similarly to PCA and CCA, LDA produces a linear combination of explanatory variables, which is not exactly what I am looking for. In addition, to my understanding it is usually applied for classification.

4. Group Lasso Group Lasso seems to match many of my requirements: it does incorporate the response variable, it does build a linear regression model based on groups of explanatory variables. The problem is that these groups need to be built manually, or known a priori, which is precisely what I am trying to achieve. I could maybe even reformulate the question as "How to automatically build the groups of variables to be used in Group-Lasso?".

5. Multi-factor Dimensionality Reduction MDR seems to be promising, as it is able to build groups of explanatory variables, while also incorporating the information of the response variable. However, this method does not seem as classical/well-known as the ones mentioned above, and it is not clear to me how the new grouped variable is built from the original explanatory variables (in particular how to set an appropriate threshold and so on...). Any clarification or explanation regarding this method would be very welcome.

I am aware that the question is rather broad, but I am not looking for a clear and definitive solution. Instead, any suggestion or direction would be greatly appreciated.

@AlbertoSinigaglia sorry for the imprecision. I have a few datasets of interest with 2347, 4356, and 8971 samples, respectively. I hope that helps you answer the original question. — CharlelieLrt, Oct 24 '22 at 23:07
@AlbertoSinigaglia https://en.wikipedia.org/wiki/Big_O_notation ... its usage in this context is inappropriate -- e.g. note - among other things - that it (i) relates to limiting behavior of functions (which we don't seem to have here) and (ii) abstracts away any scaling constants. So $O(10^4) = O(10^3)= O(1)$. ... nevertheless most people aware of the notation will probably make reasonable guesses at what the OP intended in place of what they wrote. — Glen_b, Oct 25 '22 at 01:05
I'm not quite sure what you mean by the "grouped variables" $Z_k$. You are not looking for ways to group the categories within each original $X$ variable, are you? (For instance, maybe $X_1$ has categories A, B, and C, but B and C are similarly predictive of $Y$ and fairly small, so you decide to merge them, ending up with a modified $X_1$ with only two categories: "A" and "BC".) I cannot quite tell, but if that is the case, take a look at "Delete or merge regressors for linear model selection," Maj-Kańska et al (2015) https://doi.org/10.1214/15-EJS1050 — civilstat, Oct 25 '22 at 02:05
@civilstat that's a good point. What I am looking for is similar to what you described, except that it's not within each original $X_i$, but across multiple $X_i$. The MDR link gives a simple example of what I am trying to achieve. For example to group $X_1$ and $X_2$, one (non-unique) possibility would be to define a new categorical variable $Z$, which would be $Z=1$ iff $X_1=X_2=(0,1)$ and $Z=0$ in all other cases. There are many other ways to define a $Z$, and that's what I am looking for. — CharlelieLrt, Oct 25 '22 at 03:30
Actually I don't even need the grouped variable $Z$ to be binary. It could have more categories, as long as its value can be easily linked to the categorical values of the original variables $X_i$. What I am looking for here are grouped variables $Z_k$ that will allow me to build a "good" linear model. By "good" I mean:

The model based on the $Z_k$'s is at least as good as the one based on the original $X_i$'s (which I can verify by cross-validation).

Preferably, the model is parsimonious, in the sense that each $Z_k$ group many variables together, and it is sparse (like Lasso). — CharlelieLrt, Oct 25 '22 at 03:47
@CharlelieLrt Is there any other reason why do you want $Z$ to be categorical, besides tracking where it comes from initially? For example, if it was a continuous variable, with whatever name you want, would it be OK in your scenario? — J-J-J, Oct 25 '22 at 07:22
@JJJ yes that would be OK in my scenario, as long as it can be easily related to the original observations $X_i$'s. — CharlelieLrt, Oct 25 '22 at 16:13
What about just building a big regression tree? It's a greedy search for combinations of X's that are highly predictive of Y. If it turns out that some branches of the fitted tree only use a small number of X's, then group those X's together into a Z. And if you find a way to automate this, you could do it internally for each tree in a random forest to get a bigger set of possible Z's. — civilstat, Oct 25 '22 at 20:37
@civilstat I like the idea, but I am not completely sure how to achieve that...
Is there any guarantee that a regression tree will have such branches? I am not sure if I understand how a tree could "group" a few interacting X's in a given branch.

Following my above reply about the meaning of "grouped variable", how would you define $Z$ from the $X_i$'s present in a given branch of the tree? In particular I am wondering how categories would Z have in this case? — CharlelieLrt, Oct 25 '22 at 23:40
There's no guarantee it would have such branches. But let's say you did happen to get a branch with just (say) $X_1,X_2,X_3$ in it. Say the branch starts "If $X_1=0$, go left, else right." Then the right branch ends, but the left branch splits: "If $X_2 =1$, go left, else go right." Then the left branch of that node ends, but the right branch splits: "If $X_3=1$, go left, else go right." You could define an equivalent $Z$ which is "A" if $X_1\neq 0$; "B" if $X_1=0$ and $X_2=1$; "C" if $X_1=0$ and $X_2\neq 1$ and $X_3=1$; and "D" otherwise. This one $Z$ has 4 levels and replaces three $X_i$s. — civilstat, Oct 26 '22 at 00:04
That seems in the spirit of the MDR link you posted above. Again, no guarantee that it would work well; it's just one thing that came to mind. — civilstat, Oct 26 '22 at 00:05
@civilstat okay I understand now. Yes, indeed that seems to be in the same vein as the MDR. I'll give it a try though, thanks for the suggestion! — CharlelieLrt, Oct 26 '22 at 01:22
Another option to possibly look into could be denoising autoencoders. The idea is to train an autoencoder (i.e. the input is the feature data and the output it targets is also the feature data, but in the middle of the neural network, there's usually a bottleneck = a lower dimensional vector representation that forces the model to intelligently compress), but to corrupt some of the inputs e.g. with some feature values taken from other records (but the target that the model is meant to output is the uncorrupted feature data). — Björn, Oct 26 '22 at 08:07
@Björn thanks for the suggestion. However, I mostly interested in linear models here. Also, I have seen auto-encoders being applied to set of continuous input features, but never to fully discrete/categorical features. If you have any reference for this, that would be welcome. — CharlelieLrt, Oct 26 '22 at 16:11

J-J-J · Answer 1 · 2022-10-26T05:46:55.197

I'm not sure if I understand correctly the reasons why you ruled out PCA, but a possible approach would be to use a dimension reduction technique on each group of variables, maybe including their interaction $Z$ if it makes sense. Then, use the dimensions of each group as features in your regression model.

As you mention categorical variables with more than two values, instead of PCA I'd suggest to use multiple correspondence analysis as a dimension reduction method, as it is designed specifically to work with categorical data.

Another approach would be to just model interactions, without using dimension reduction at all. However, modeling all possible interactions doesn't seem to be a good idea (I mention that, as you say in a comment "There are many other ways to define a $Z$, and that's what I am looking for"). Anyway, with all possible interactions, your dimensionality issue would still be there. Keeping only the interactions and omitting the main effects is probably not appropriate to solve this dimensionality problem.

Edit following your comment: including all possible interactions in your model and applying stepwise regression on it (in order to keep only the significant interactions) would lead to the same problems associated to stepwise regression.

Unfortunately, as you mention, dimension reduction (MCA, PCA) could make your model lose a good part of its predictive power. In your place, I'd simply try to model interactions based on domain knowledge -but I guess this is a problem to you as your initial question is about not having to model these interactions "by hand". Of possible interest regarding automated model selection: Algorithms for automatic model selection.

The main reason why I ruled out PCA is because it does not directly incorporate the response variable $y$. If I apply PCA to the $X_i$'s (ignoring $y$), then the 1st component would be the direction of greatest variation in the $X_i$'s. But there is no guarantee that this 1st component will also correspond to a significant variation in the $y$ variable, because PCA does not capture this "explanation-response" relationship. I am not familiar with MCA, but if I understand correctly it suffers from the same problem. I just wonder if there is a way to incorporate the response in these methods... — CharlelieLrt, Oct 26 '22 at 01:38
@CharlelieLrt Indeed, including $y$ in the MCA would lead to data leakage. If you have an idea of the interaction $Z$ that has an effect on $y$ (e.g. $Z = X1 \times X2 \times X3$ , $Z = max(X1, X2) \times X3$, $Z = min(X1, X2, X3)$ , etc.), then include $Z$ in the MCA along with $X1, X2, X3$, to make the dimensions more likely to capture the effect. If you have no idea of how to model the interaction, it sounds like you're looking for something like including all possible interactions in your model and applying stepwise regression on it, which is probably problematic. — J-J-J, Oct 26 '22 at 04:24
indeed, I do not have any idea of the possible interactions, and I agree tat a brute force approach including all possible interactions and using stepwise regression is probably not a good option. — CharlelieLrt, Oct 26 '22 at 16:13

How to automatically group categorical explanatory variables in a linear regression model?

1 Answers1