1

I have a rough understanding of the outline PCA. Given $n$ samples of $m$-dimensional data: $\vec{x}_1, \vec{x}_2, \dots, \vec{x}_n$, PCA aims to find an appropriate orthonormal basis called principal components $\vec{p}_1, \vec{p}_2, \dots, \vec{p}_m$ to re-express the original data.

I don't get how it is useful to express data in terms of principal components $\vec{p}_1, \vec{p}_2, \dots, \vec{p}_m$. What can we do with these newly projected data? I want to know the details of how data analysis would be carried out once principal components are found.

Here is one concrete example: Imagine we are given a large box of various kinds of apples and we need to find the best way to divide apples into groups. Suppose we have $n=30$ apples. For each apple, we measure $m=5$ numerical quantities: how red it is, diameter, height, bitterness, and sweetness. All of these data will be encoded into $m$-dimensional vectors: $\vec{x}_1, \vec{x}_2, \dots, \vec{x}_n$. Then we find the principal components of this data set which will be $\vec{p}_1, \vec{p}_2, \dots, \vec{p}_m$. After rewriting the original data in terms of these principal components, how are we going to start classifying $n$ apples? One possible way I can think of is to put apples with $\vec{p}_1$ coefficient that dominates coefficients of the rest of components $\vec{p}_2, \dots, \vec{p}_n$ into one category, and put apples with $\vec{p}_2$ coefficient that dominates coefficients of the rest of components into another...?

Jimmy Yang
  • 111
  • 5

1 Answers1

2

When the number of variables is large, relative to the number of observations, lots of problems can happen. PCA reduces the number of variables, while losing as little information as possible.

Why do this? One example is principal components regression. If you have (say) one dependent variable and 50 indpendent variables on 100 observations, attempts to fit a regression will be overfit. You can take the PCs of the 50 variables and use the first few.

Another is for graphing. If we want to locate the observations in some sort of space, then visualizing a 2 or 3 dimensional space is much easier than one with many dimensions.

Yet another might be finding multivariate outliers. This is a notoriously hard problem when there are a lot of dimensions. PCA might let you locate some potential outliers to look at more closely. See this thread.

There are surely other uses as well.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383