0

I have read probably too many things and maybe confused myself along the way :)

Anyway, let's say I had a data set that I have performed PCA analysis on.
Let's say a spectral dataset 100 observations x 3000 variables. After PCA, I have maybe 3-4 components, and now a 100 x 4 matrix of scores (I guess there is the spectral residuals matrix that may be used to further enhance outlier detections but I'm not looking to do that now).

Okay, so let's say I now want to plot those scores, PC1 v PC2, and draw a hotelling's 95% confidence ellipse around the data. Since the data is already rotated, I don't need to calculate the eigen vectors (i.e. I don't want to rotate a ellipse given some sort of correlation in the scores plot...), I already have calculated them.

Okay, so I start with this equation $T^2 = (\bar x - x_i)S^{-1}(\bar x - x_i)^t$

So when I implement this and test it, it seems to be calculating something useful in terms of a value that looks like Mahalanobis distance type value as I added outliers to my dataset and they scream a different value. When I vectorize above, (i.e. calculate the $T^2$ for the entire data set) it appears I take the diagonal of the resulting matrix to get relevant information.

Alright, but from here, I'm not quite sure where to go. I have read several aspects and I'm not a strong in statistics. There is reference to doing something like this:

$F = \frac {n-k}{k(n-1)}T^2 \approx F_{k,n-k,1-\alpha}$

So I think in this case, n is the observations (100) and k is the variables (# of components or 2?? for plot?) From here, I'm not sure how to solve say $F_{k,n-k,1-\alpha}$ for a case where $\alpha = 0.95$ for the 95% confidence ellipse. A lot of places say look up in a table but that is not an option. I need to be able to calculate this for any number of n and thus k?
Also, I'm not sure in this case what I would use for $T^2$ as that looks like a relative distance between two points.

When I try to look more into the F values, there are two, one also called Fcrit and they are typically calculated using some functions in matlab or excel but I'm trying to understand how to actually calculate them. I get confused when I look into F distribution information as it seems to be ratio of variance/DoF for two data sets (that follow a chi distribution).

I also found this information elsewhere:

$T^2 = \sum_{i=1}^{k} \frac {t^2_i}{s_i} $

Except in this case, I'm not sure that $t^2$ is. I've seen before that is the axes of so my ellipse equation is then simply:

$T^2 = \frac {x}{s_{pc1}} + \frac{y}{s_{pc2}}$ and then $T^2$ might able to be solved for based on earlier efforts?

Does anyone know how to take this information and create an ellipse say like the one below where the length of the axis are based on the Hotelling's 95% confidence limit?

Thanks in advance!

I hope this isn't confusing or entirely off base.

EDIT: With the comment, and the linked answer, I decided against calculating the F value and am using NMath FDistribution for the project (testing with scipy.stats.f.ppf in python).

If I understand it correctly, which is unlikely, my solution was this in python.

import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
def hotellings(scores):
a = 0.95
n = len(scores)
p = len(scores[0])

##scipy.stats.f.ppf 
##https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html
f = ss.f.ppf(a,p,n-p)

sx = np.std(scores[:,0])
sy = np.std(scores[:,1])

m = [np.pi*x/100.0 for x in range(200)]
cx = np.cos(m) * sx * f
cy = np.sin(m) * sy * f



plt.xlabel('PC1')
plt.ylabel('PC2')
plt.scatter(scores[:-2,0],scores[:-2,1])
plt.scatter(scores[-1,0],scores[-1,1],label='Strange Outlier')
plt.scatter(scores[-2,0],scores[-2,1],label='Extrapolated Outlier')    
plt.plot(cx,cy,'--',color='#a8005c',label='T95')  
plt.legend() 
plt.show()

return cx,cy

My solution?

Not sure if I understand it correctly, but I basically just get the F value with the appropriate DoF based on observations/components. From here, I think I am just solving the inequality

$F_{\alpha,p,p-n} < (\frac{pc_1}{s_1})^2 + (\frac{pc_2}{s_2})^2 $

Thus, my parameters for the ellipse become the product of the axis standard deviation and the F critical value (F @ 95% confidence for the particular scores matrix DoF)...

$a = F*s_1$

$b = F*s_2$

example hotelling T^2

Chemistpp
  • 111
  • If you need to compute percentage points of the F ratio distribution, then you need an accurate implementation of the inverse Beta integral or its equivalent. See https://stats.stackexchange.com/questions/52341 for some remarks and references. – whuber Jun 03 '22 at 15:39
  • so basically, I should probably avoid trying to calculate it if I can xD Curious if I can just use a chi distribution and have a lot less cases to cover.. I'm not entirely sure why the F distribution is used in this case – Chemistpp Jun 03 '22 at 16:14

0 Answers0