Say I have a data set with an x and y column such that I can plot all the points on a 2D plot. Now from this post I know how to calculate the standard deviation, but then is it possible to get the z-score? I can't follow the normal formula because I won't get a single score, but rather a pair of values...
Asked
Active
Viewed 3,523 times
2
-
1Yes, a Z-score is related to a single observation. You'd use the mean and standard deviation to calculate a Z-score of a new observation by simply subtracting the the new observation from the mean and then dividing by the standard deviation. I think what you are trying to do (but we don't know because you didn't tell us) is calculate a a standardized value for each x-value. That will give you a new z and y column where z are the standardized x values. – StatsStudent Nov 05 '16 at 04:42
-
Hi @3209Cigs, forgive me if I get the terminology wrong, my goal is to find points that are "significantly" further from most of the points. So if I only had one column of data, I would calculate the z score for every point and select for points with a score greater than 2 (ie. points that are two standard deviations from the mean). Now how can I apply the same sort of selection process to points on a 2D plot? Thanks! – user107777 Nov 05 '16 at 05:01
-
A 2D plot has nothing to do with what you are trying to do as you describe it. You simply calculate the Z-score (standardize) each X point and then subtract 2 from it. If the absolute value of that result is > 2, then you'd consider it significantly further from the rest of the points. That's all there is to it. – StatsStudent Nov 05 '16 at 09:13
-
Can you help me understand why you would only standardize the x value but not the y value? Because then am I not only considering points that are significantly different along one dimension, and ignoring differences along the y-axis? And do you mean if the absolute value of the standardized x - 2 is greater than 0, not great greater than 2, then it is considered significantly different? – user107777 Nov 05 '16 at 15:05
-
OK. So if you want both check both X and Y and they are independent then you do the same calculation for x and then again,but separately for y. Yes, sorry for the confusion. You want to take x - 2 and then determine if the absolute value is greater than 0, or equivalently, but more simply, you could just test if abs(x) > 2. I think the problem here is that your question is quite poor. We are all left guessing at what you are actually trying to do. What analytical problem are you trying to solve? Is this a homework problem by any chance? – StatsStudent Nov 05 '16 at 17:51
-
BTW, if your x and y is not independent, you'll need to take the covariances into account. Try using something like Hotelling's T-squared if your goal is to find 2D differences. – StatsStudent Nov 05 '16 at 17:55
-
Not a homework question, I'm a grad student (with obviously very little stats knowledge) trying to pick out data points that are statistically different (or I guess unexpected) according to the distribution of most data points, then I can continue with further experimental studies according to the data points I picked out. I will definitely try your method, but I was wondering what you think about the following approach: can I just calculate the distance between all the points and the centroid (like in the post that I linked), find the z score for all the distances, and then select those >2? – user107777 Nov 05 '16 at 19:40
-
2A natural multivariate generalization of a z-score is the Mahalanobis distance. In one dimension, this distance is the absolute value of the z-score. – whuber Nov 05 '16 at 21:21
-
Hi @whuber, if I were to use the Mahalanobis distance, can I make the same generalizations as the z-score, ie. if a point has a Mahalanobis distance greater than 2, it is at least two standard deviations from the mean? – user107777 Nov 06 '16 at 00:18
-
2It doesn't work quite that way, because additional dimensions change the distributions of distances. The proper generalization is the Hotelling's T-squared to which @3209Cigs referred. – whuber Nov 06 '16 at 15:42
-
1If the linked example is representative of the type of problem you are trying to solve, I wonder if multiple linear regression would take you where you want to go. – Tavrock Nov 23 '16 at 19:42