Toy example dataset for testing PCA implementation

Question

I want to check my implementation of dimensionality reduction with PCA, so I'm looking for a test case. I have found other implementations on the web as well, so I will be comparing with those too.

Can anyone give me a test case, where I have an $N \times D_1$ data matrix and I want to keep $D_2$ components after PCA, so let's say $N = 4$, $D_1 = 5$ and $D_2 = 3$, in other words, I have 5 features from 4 samples, and I want to do PCA and keep 3 components. Given that PCA does not have any randomness, a dataset should give the same output in different implementations of PCA, right?

If anyone is interested, what I'm doing (in MATLAB) is:

[COEFF,SCORE,latent] = princomp(data);
D2 = min(find((cumsum(latent)./sum(latent))>0.9)); % or simply 3 for this case
reduced_testdata = bsxfun(@minus, testdata, mean(traindata)) * COEFF;

P.S: examples for other dimensionality reduction methods (like LDA and CCA) are also very welcome, as they can help me or other users to check their code as well.

You could take iris data and follow this example. Search this site for iris data to see if anybody has shown LDA, CCA with it. — ttnphns, Jun 25 '16 at 20:53
Ha... I just recalled one of those was me here (LDA)http://stats.stackexchange.com/q/82497/3277. Other guys probably also created examples. — ttnphns, Jun 25 '16 at 20:56
@ttnphns it must feel good to cite yourself :) and these will definitely be useful, thanks! — jeff, Jun 25 '16 at 21:01

Franck Dernoncourt · Accepted Answer · 2016-06-25T21:42:29.860

4

Since you are using Matlab, you can use hald:

% From http://www.mathworks.com/help/stats/princomp.html
load hald;
[pc,score,latent,tsquare] = princomp(ingredients);

or cities, amongst others.

Given that PCA does not have any randomness, a dataset should give the same output in different implementations of PCA, right?

Right, but there could be differences in rounding, some corner cases, and some functions might not sort by descending principal component scores.

edited Jun 25 '16 at 21:42

answered Jun 25 '16 at 20:54

Franck Dernoncourt

46,817
33
176
288

The numerical sign ("+", "-") of a set of loadings (as a whole) of one factor is arbitrary and sometimes different in different implementations. For instance I normed this for my own purposes to be "all factor loadings for the first item are positive". But it might be different, for instance "the highest (absolute value) of the loadings of one factor gets the '+' sign, and the others are then adapted" - or the direction of each factor is left simply random, just dependend on the numerical output of some part of the program. – Gottfried Helms Jul 02 '16 at 09:00

Toy example dataset for testing PCA implementation

1 Answers1

Linked