Does dimensionality reduction using PCA or UMAP or others try to preserve the most important features of the data so you can see it in a 2D or 3D space?
-
Please don't use all caps in title of the question – Tim May 10 '20 at 14:30
-
Let us know if you have further questions or need more explanation. If this answer or any other one solved your issue, please mark it as accepted :) – Camille Gontier Dec 18 '20 at 10:05
3 Answers
Preserving the most important features of the data is indeed the point of dimensionality reduction. Simplifying data allows you to plot them nicely in a 2D or 3D space, but you also have other possible applications:
- PCA allows to identify strongly correlated observations, and hence to reduce redundancy in data;
- it saves computation time while running a leaning algo on your data;
- you can also see it as a noise reduction technique.
For more details, I strongly recommend Andrew Ng's lecture notes on PCA, which are very concise and simple.
- 2,616
- 6
- 13
Visualize data in a 2D or 3D space is a good application of dimension reduction algorithms, but the algorithms can have other advantages.
Intuitively, many real world data contain "redundant" information, and people want to remove them and have a cleaner view on data, and build a simpler model. For example, some real world data may record one people's highlight in different unit, but they are the same thing.
Using feature selection algorithms can greatly reduce the computation and system complexity in later stage. Simpler system will be easier/cheaper to build and maintain.
In general, many people like simple system and explanations that can do similar work. For example, if I have two systems, one is a simpler linear model, another is a complicated neural network that uses GPU for computation, both of them have ~80% "accuracy". Most people will chose the linear model.
- 36,852
- 25
- 145
- 242
Dimensionality reduction is used to reduce the number of dimensions of your data. This is achieved by transforming your data into such form that has smaller dimension (less columns), but preserves some of the main characteristics of the data. This is different from feature selection, i.e. selecting some features (columns), while dropping other columns from your data. In dimensionality reduction we are aiming at loosing as small amount of information as possible, so all the original features take part in creating the new features of reduced dimensionality. You just need to remember that the algorithms extract the "most important" information given some, specific, definition of that they mean as most important. So different algorithms could focus on different things, and some of the solutions may be more useful in some cases, then others. In short, they don't need to "work" in all cases.
- 138,066