Finding a latent representation of a high-cardinality one-hot encoded variable

Question

I am working on a clustering project on a dataset that has some numerical variables, and one categorical variable with very high cardinality (~200 values). I was thinking if it is possible to create an embedding for that feature exclusively, after one-hot encoding (ohe) it. I was initially thinking of running an autoencoder on the 200 dummy features that result from the ohe, but then I thought that it may not make sense as they are all uncorrelated (mutually exclusive). What do you think about this?

On the same line, I think that applying PCA is likely wrong. What would you suggest to find a latent representation of that variable?

One other idea was: I may use the 200 dummy ohe columns to train a neural network for some downstream classification task, including an embedding layer, and then use that layer as low-dimensional representation... does it make any sense?

Thank you in advance!

You can use tf.keras.layers.Embedding: https://keras.io/api/layers/core_layers/embedding/ — Amin Shn, Jan 31 '23 at 22:08
@Sycorax I want to run a clustering algorithm, so I need to reduce the dimensionality of that ohe'ed categorical variable. — ockham_blade, Feb 01 '23 at 08:31
@AminShn you mean using the embedding layer in a surrogate classification model? — ockham_blade, Feb 01 '23 at 08:31
You asked how to embed a categorical variable without ohat, and I said that's possible using the Embedding layer which maps each category to a higher dimensional vector in the latent space. — Amin Shn, Feb 01 '23 at 08:41
This sounds like an XY problem. You want to do clustering but you’re concerned that a high cardinality category will harm the clustering method. — Sycorax, Feb 01 '23 at 12:15
@Sycorax I may haven't expressed my question clearly, but it is not an XY problem. I want to do clustering, but my question is how to embed the categorical variable on its own. — ockham_blade, Feb 01 '23 at 21:04
Your question suggests training a classifier, so I don't think it's accurate to say that the question is solely asking "how to embed the categorical variable on its own." A classifier requires labeled data to train the network, including the values of the embedding. If your question is actually "How do I represent a categorical variable with a large number of values?" then you should [edit] to ask that -- but there are already many, many duplicates about that topic, starting with the ones listed here. If you feel these aren't a good fit, the next step is to do a Search. — Sycorax, Feb 01 '23 at 22:54

Finding a latent representation of a high-cardinality one-hot encoded variable

0 Answers0