1

I have a survey dataset with 200 columns (encoded as numbers) and am trying to reduce the number of dimensions. After applying PCA, I can reduce the number of dimensions but each PC barely explains the variance in the dataset. It requires 150 PC's to explain 85% of the variance, which doesn't really do me any favours in reducing the number of dimensions.

I can use less PC's and explain less of the variance, but I'd like to get the best trade-off.

I'm aiming to get to a maximum of 50 PC's, but this only explains 50% of the variance.

Are there any techniques which are commonly used to get around this problem? For example could there be certain columns which are making it harder to explain the variance using PCA and how do I find them? Or are there alternative dimensionality reduction techniques I could use?

Edit: The reason for the dimensionality reduction is to cluster the data, so I would like to have the most variance explained with as little PC's as possible.

fx-85
  • 91
  • 1
    Why do you need to do dimensionality reduction? – dipetkov Jun 30 '22 at 22:44
  • 1
    There are many techniques, most of them heuristic and only loosely justified by theory. Many of them begin by examining a "scree plot." What you might mean by "best trade-off," though, must depend on how you will be using the PCA results. Being more specific about your situation would help you get appropriate answers. – whuber Jun 30 '22 at 22:47
  • How did you approach your problem in the end? I am encountering the same problem (or at least, I think so, based on your description) so I am curious as to know if you came any further. – Camille Dec 05 '22 at 09:08
  • @Camille I ended up reducing the number of dimensions and just taking the low variance explained. There's not real way of getting around it. There are various studies which show that certain low variances explained are justifiable depending on your problem. – fx-85 Dec 13 '22 at 14:31

0 Answers0