Confusion on calculating Mahalanobis distance between a point and a cluster

Question

I am slightly confused as to how you calculate Mahalanobis distance given a set of data. I have tried asking my tutor for help but he does not seem interested in helping what so ever and I am continuously insulted. I thought I would turn to the community for help.

I have a set of data here and I have performed distance calculation once using Euclidean distance to group the data. Now I am looking to calculate distance using Mahalanobis distance. I have calculated the means and also calculated a Pooled covariance matrix. I am unsure as to what I need to do from here to begin calculating distances for each point.

I think what I need to do is take a point and subtract the mean values. I then calculate a Pooled Covariance Matrix for each group and use this to calculate the distance between the point and the clusters data distribution. Whichever one yields the smallest distance out of the clusters, that will be the cluster the point belongs to.

Data clustered into 3 clusters after performing Euclidean distance to place points into initial groups

Pooled Covariance matrix \begin{bmatrix}1.394&1.702\\1.702&6.62\end{bmatrix}

Inverse Pooled Covariance \begin{bmatrix}1.046&-0.269\\-0.269&0.221\end{bmatrix}

Mahalanobis Formula

Pooled covariance matrix for each cluster

Cluster1 \begin{bmatrix}0.873&-0.234\\-0.234&0.158\end{bmatrix}

Cluster2 \begin{bmatrix}6.060&-3.030\\-3.030&6.060\end{bmatrix}

Cluster3 \begin{bmatrix}1.189&-0.573\\-0.573&0.722\end{bmatrix}

Calculating distance for point (1,1)and Cluster 1 distribution

Since cluster1 distribution has a smaller distance compared to cluster2, this point will belong to cluster1.

Welcome to Cross Validated! You have the equation for Mahalanobis distance, right? Where are you getting stuck with it? — Dave, Mar 22 '22 at 21:11
Most of the answers at https://stats.stackexchange.com/questions/62092 give formulas. — whuber, Mar 22 '22 at 21:34
@Dave Thank you Dave! I have added the formula to the post (Sorry for not including this before) I'm getting confused on whether I need to calculate a Inverse Covariance Matrix for each cluster and then use this to determine the distance for a point. Whichever distance among clusters is the smallest, I assign this point to that cluster group. Is this correct? — ASH, Mar 23 '22 at 06:40
How did you calculate that 1.394, 1.702, 1.702, 6.62 matrix? — Dave, Mar 23 '22 at 08:51
@Dave I think I might have done it wrong. I followed a incredibly confusing example by my tutor where he calculated a Pooled Covariance Matrix for every group and not individually. I have updated the post to show Pooled Covariance matrices I have calculated for each group. — ASH, Mar 23 '22 at 09:00
How, and why, do you combine those three matrices to get the 1.394 matrix? — Dave, Mar 23 '22 at 09:05
@Dave Calculating the combined Matrix which produces 1.394 is what my tutor demonstrated to do. I am not sure why. This is where most of my confusion came from as this did not make sense. We need to find the distance between a point and a clusters distribution which is why I think we would need 3 separate Inverse Covariance matrices which is what I am now doing. Sorry this might be confusing but I don't blame you since this is how my tutor horribly explained it. He calculated a combined matrix for all groups which would then be the distribution of all the data and not individual groups of data — ASH, Mar 23 '22 at 09:27

score 1 · Accepted Answer · answered Mar 23 '22 at 09:36

1

The idea of a pooled covariance matrix comes from the following argument.

Each group can have its sample covariance matrix calculated.
However, we believe that the groups all have the same population covariance and only differ in their means.
In order to get the tightest estimate that we can about the one covariance matrix shared by all three groups, we pool the sample covariance matrices for each individual group.

If you’re thinking that the groups might not have the same population covariance matrix, you’re right. However, your assignment seems to be assuming one population covariance matrix that is estimated using pooling of the sample covariance matrix from each group.

It’s possible that your calculation of the 1.394 matrix is incorrect, though the idea of having one population covariance matrix for all three groups is the key. Then it makes sense why you would use just the one covariance matrix in determining the Mahalanobis distance from each group, since you believe that to be the best estimate of the covariance matrix for all three groups (and, therefore, each individual group).

answered Mar 23 '22 at 09:36

Dave

62,186

Thanks for the reply! If I use a Pooled Covariance matrix for all groups and use that matrix to calculate the distance between a point and the data distribution. How would I know which group the point would belong to since we are now just calculating a distance from the entire distribution of data. I have updated the post to show how I am currently doing my calculations for distance. – ASH Mar 23 '22 at 09:50
You calculate three Mahalanobis distances. Pooling makes the assumption that the three groups have the same covariance matrix but not the same mean. – Dave Mar 23 '22 at 09:53
Instead of Pooling, I would calculate the inverse covariance matrix for each group and use this to determine the distance between the point and each cluster? – ASH Mar 23 '22 at 09:54
No, you calculate one covariance matrix (pooled) than all three groups share. // What you propose about three separate covariance matrices also makes sense, but that does not appear to be the assignment. – Dave Mar 23 '22 at 10:00
Ah. The task is to essentially cluster the points into the relevant groups using Mahalanobis distance. In this case using a Pooled matrix would not find me a distance between each cluster to determine which cluster the point will belong to. Instead, it will be a distance between the point and the entire distribution which would not allow me to cluster to cluster 1,2 or 3. Since we want to cluster, using an Inverse Covariance Matrix would be what I utilise to determine the distance between the point & cluster, allowing me to group based on smallest distance. This correct? – ASH Mar 23 '22 at 10:05
You calculate the Mahalanobis distance from the point to each of the three groups. However, you assume the three groups to have the same pooled covariance matrix, rather than having different covariance matrices. – Dave Mar 23 '22 at 10:14
Sorry I am still quite confused. If you have a pooled covariance matrix and perform a distance calculation. This would then be the distance of the point from the entire distribution of data? – ASH Mar 23 '22 at 10:20
Let us continue this discussion in chat. – Dave Mar 23 '22 at 10:26

score 0 · Answer 2 · answered Jul 23 '22 at 14:58

0

Why not pooling all raw data into one main sample, calculating a common covariance C from there? (I'm assuming that to be different from pooling the samples' covariances together.) Would not that always take care of all data being measured in the same scale? From there you could proceed by introducing C and inverting it in your formula for D2 shown above? Juan F.

answered Jul 23 '22 at 14:58

juan fernandez

1

Why would that be preferable to what the OP asked for which was advice on pooling? – mdewey Jul 24 '22 at 15:39

Confusion on calculating Mahalanobis distance between a point and a cluster

2 Answers2