What is the proof that non-linearly separable data can't become linearly separable with the results of PCA?

Question

Give a non-linearly separable dataset $X,$ I want to proof that after performing PCA on it, the resulting dataset is guaranteed to be still non-linearly separable. I think we could argue that we still use the eigenvectors of the original feature space to describe the transformed feature space, and if it wasn't possible to find a linear combination for a hyperplane, it's also not possible to do so in the truncated space, but not exactly sure if that's the right approach. A second approach of mine was to try to model PCA as a linear transformation altogether using a matrix for the transformation $$J(XW+b)$$ and thinking about the hyperplane in terms of $$XW+b>0$$ or $$XW+b<0,$$ like in SVM. Unfortunately i didn't get ahead with this approach either (didn't know how to show that the sign for each datapoints stays the same after performing the J matrix), so would be interested in how to solve this.

Proof by ChatGPT (not sure if correct):

You don't need a proof: all you need is the fact that PCA changes how the data are represented (as linear combinations of a basis), but it does not change their geometrical relationships and linear separability is a geometrical relationship. An analogy might help. Logic says that from the assumptions "all men are mortal" and "Socrates is a man" you can conclude "Socrates is mortal." What is the result, you ask, if we rename Socrates "Bert"? The answer doesn't change. Likewise, if there exists no separating hyperplane, there still doesn't exist one when you use a different coordinate system. — whuber, Jan 21 '23 at 21:50
For me it also makes intutive sense, but our professor demands a proof in the exam. That's why i was asking. I don't know if its enough to say that linear transformations preserve relative properties :/ I attached a proof provided by chatGPT which is basically what i wanted to do saying that PCA is a linear transformation with XJ=X_tilde. Don't know if that is a valid way to describe PCA though, since usually it's described by its Eigendecomposition. — Saltuk Kezer, Jan 21 '23 at 22:05
Please explain to us how your chatbot can multiply a "datapoint with dimensionality D" by a "matrix with shape 1 x N"!! You can be absolutely sure that junk is not correct (and quite confident that it's pure BS). You will be better off reviewing the relevant definitions and constructing a valid, understandable argument that reflects the basic intuition. — whuber, Jan 21 '23 at 22:12
ChatGPT is not even close to reliable: its "proofs" are parts of conversations in a style of proofs, but there is no reason to believe any of its steps are actually correct — Henry, Jan 21 '23 at 22:33
If you use the full PCA, this is just a reversible linear transformation and so any hyperplane after the transformation corresponds to a hyperplane before the transformation. If you use a PCA truncated to the leading principal components, then the transformation is not reversible and the assertion is incorrect. — Henry, Jan 21 '23 at 22:39
Because the X matrix is of shape (NxD). The equation WX+b with X being of shape (NxD) and W with shape (1XN) holds. We get a scalar value which is > 0 if on one side of the hyperplane and < 0 if on the other side like in SVM. I mean i just used chatGPT to provide me the proof, i already had the intuition that i want to construct a proof based on showing that if we perform the linear transformation it must hold that our points that were initially on different sides of an arbitrary hyperplane, are still on the same side after the transformation. — Saltuk Kezer, Jan 21 '23 at 22:58
@Henry Makes sense, If i choose J to be of shape (NxK) though, shouldn't that proof be valid? I'm just relying on the properties of linear transformations basically, to formalize the intuition of preserving relativeness in the new coordinate system. — Saltuk Kezer, Jan 21 '23 at 22:59
What makes you think ChatGPT provided you a proof as opposed to a wall of text derived by pattern matching? — jbowman, Jan 21 '23 at 23:45
I’m voting to close this question because I don't think we should be vetting ChatGPT "proofs", although I do realize there's some public service in attempting to explain to posters that ChatGPT does not actually construct proofs except by accident. — jbowman, Jan 21 '23 at 23:55
Perhaps we should create a canonical question "Are ChatGPT "proofs" really proofs?" that we can use to refer posters to? — jbowman, Jan 21 '23 at 23:58
@jbowman, while I hold with your opinion, I think the present meta CV discussion caters to the case. — User1865345, Jan 22 '23 at 08:59

whuber · Answer 1 · 2023-01-22T14:56:06.717

This formulation of the question is presented as a way to expose the relevant concepts of separability and PCA.

Separability

Suppose $S$ and $X$ are sets and that $X = Y\cup Z$ partitions $X;$ that is, $Y\cap Z = \emptyset;$ and let $\mathscr F$ be a set of $X$-valued functions defined on $S.$

Let us say that a pair $\{A,B\}$ where $A\subset S$ and $B\subset S$ is $\mathscr F$-separable when there exists $f\in\mathscr F$ for which either $f(A)\subset Y$ and $f(B)\subset Z$ or $f(A)\subset Z$ and $f(B)\subset Y.$

(This language assumes $X=Y\cup Z$ is understood in the context. Otherwise, we would use a more specific expression like "$(\mathscr F, Y\cup Z)$-separable.")

Linear separability

Let $S=\mathbb W$ be a real vector space. Take $\mathscr F$ to be the set of translates of linear forms. That is, $f\in\mathscr F$ means $f$ is a real-valued function on $\mathbb W$ and there exists a scalar $b$ and a linear form (aka covector) $\phi$ for which $f(w)=\phi(w)+b$ for all $w\in\mathbb W.$ Thus $X=\mathbb R.$ Taking $Y$ to be the positive numbers and $Z$ to be the non-positive numbers is how linear separability is often defined. If your definition differs, you will want to pause here to verify that it's equivalent to this one.

PCA

Given a probability distribution $\Pr$ on a vector space $\mathbb W$ (together with other information: namely, an inner product), PCA finds a basis of $\mathbb W$ determined by that distribution. That's all we need to know about PCA for this question.

The question

Given samples $\mathscr A$ and $\mathscr B$ of a real vector space $\mathbb W,$ let $\Pr$ be the empirical distribution of the combined sample. (That is, it assigns equal weight to every vector in the combined sample. Such a distribution always exists provided the combined sample is not empty.) Let $A\subset \mathbb W$ be the support of $\mathscr A$ (the set of distinct values in the sample) and $B\subset \mathbb W$ be the support of $\mathscr B.$

Prove that if $\{A,B\}$ is not linearly separable and you apply PCA to the combined sample $(\mathscr A,\mathscr B),$ then $\{A,B\}$ is still not linearly separable.

There is nothing to show.

The reason is that the linearity of a form $\phi$ does not depend on choosing a basis for a vector space. How is this standard fact proven? By showing that (1) a change of basis effects a linear transformation of coordinates and (2) composing such a linear transformation with a linear function is still a linear function. If you feel you need to demonstrate this to satisfy your professor, that would be fine.

You might be working with a slightly different concept of PCA in which only the first few principal vectors are retained. By using only a few of the principal vectors you are limiting the set of linear forms in $\mathscr F$ you are willing to use after applying PCA (you have thrown away all those defined on the unused principal vectors and vanishing on the selected principal vectors). That only makes it harder to achieve separability after applying PCA.

I remember you once argued that we can have principle components beyond the dimensions of the original matrix. In that case PCA might make the data seperable? — Sextus Empiricus, Jan 29 '23 at 14:20
@Sextus It would depend on how you mapped the data to the additional PCs. As I recall, that argument was merely that we can always extend a partial basis to a basis. In that case, the loadings on the additional PCs would all be zero and nothing would change about separability. The argument here shows you would need a nonlinear map from the data to the new components to achieve separability -- and that is exactly the idea behind the kernel trick etc. — whuber, Jan 29 '23 at 14:46

score 0 · Answer 2 · answered Jan 29 '23 at 14:16

PCA is a linear transformation (Linearity of PCA), and a composition of linear transformations (effectively the same as a product of matrices representing those linear transformations) is itself a linear transformation.

Say:

Let

$x$ is your original data
and $y = Ax$ is the data after some PCA transformation

If

$z = By$ is a linear transformation of that data $y$, which seperates the data

Then

$z = BAx$
and thus there is also a linear transformation, the composition $(B \circ A)$ that seperates the data $x$.

The above lines prove that if the PCA transformed data is linearly seperable, then also the non-PCA transformed data will be seperable.

By contraposition we also have that

If the non-PCA transformed data is not linearly seperable, then also the PCA transformed data will be not be seperable.

What is the proof that non-linearly separable data can't become linearly separable with the results of PCA?

2 Answers2

Separability

PCA

The question