1

I am working on a clustering project on a dataset that has some numerical variables, and one categorical variable with very high cardinality (~200 values). I was thinking if it is possible to create an embedding for that feature exclusively, after one-hot encoding (ohe) it. I was initially thinking of running an autoencoder on the 200 dummy features that result from the ohe, but then I thought that it may not make sense as they are all uncorrelated (mutually exclusive). What do you think about this?

On the same line, I think that applying PCA is likely wrong. What would you suggest to find a latent representation of that variable?

One other idea was: I may use the 200 dummy ohe columns to train a neural network for some downstream classification task, including an embedding layer, and then use that layer as low-dimensional representation... does it make any sense?

Thank you in advance!

  • What is the purpose of obtaining the latent representation? – Sycorax Jan 31 '23 at 21:01
  • You can use tf.keras.layers.Embedding: https://keras.io/api/layers/core_layers/embedding/ – Amin Shn Jan 31 '23 at 22:08
  • @Sycorax I want to run a clustering algorithm, so I need to reduce the dimensionality of that ohe'ed categorical variable. – ockham_blade Feb 01 '23 at 08:31
  • @AminShn you mean using the embedding layer in a surrogate classification model? – ockham_blade Feb 01 '23 at 08:31
  • You asked how to embed a categorical variable without ohat, and I said that's possible using the Embedding layer which maps each category to a higher dimensional vector in the latent space. – Amin Shn Feb 01 '23 at 08:41
  • This sounds like an XY problem. You want to do clustering but you’re concerned that a high cardinality category will harm the clustering method. – Sycorax Feb 01 '23 at 12:15
  • @Sycorax I may haven't expressed my question clearly, but it is not an XY problem. I want to do clustering, but my question is how to embed the categorical variable on its own. – ockham_blade Feb 01 '23 at 21:04
  • Your question suggests training a classifier, so I don't think it's accurate to say that the question is solely asking "how to embed the categorical variable on its own." A classifier requires labeled data to train the network, including the values of the embedding. If your question is actually "How do I represent a categorical variable with a large number of values?" then you should [edit] to ask that -- but there are already many, many duplicates about that topic, starting with the ones listed here. If you feel these aren't a good fit, the next step is to do a Search. – Sycorax Feb 01 '23 at 22:54

0 Answers0