12

I am having a problem computing the pearson correlation coefficient of data sets with possibly zero standard deviation (i.e. all data has the same value).

Suppose that I have the following two data sets:

float x[] = {2, 2, 2, 3, 2};
float y[] = {2, 2, 2, 2, 2};

The correlation coefficient "r", would be computed using the following equation:

float r = covariance(x, y) / (std_dev(x) * std_dev(y));

However, because all data in data set "y" has the same value, the standard deviation std_dev(y) would be zero and "r" would be undefined.

Is there any solution for this problem? Or should I use other methods to measure data relationship in this case?

GaBorgulya
  • 3,363
Andree
  • 235
  • There is no "data relationship" in this example because y does not vary. Assigning any numerical value to r would be a mistake. – whuber Apr 02 '11 at 04:42
  • 1
    @whuber - it is true that the $r$ is undefined, but not necessarily that the "true" unknown correlation $\rho$ cannot be estimated. Just have to use something different to estimate it. – probabilityislogic Apr 02 '11 at 07:26
  • @probability You presuppose this is a problem of estimation and not simply one of characterization. But accepting that, what estimator would you propose in the example? No answer can be universally correct because it depends on how the estimator will be used (a loss function, in effect). In many applications, such as PCA, it seems likely that using any procedure that imputes a value to $\rho$ may be worse than other procedures that recognize $\rho$ cannot be identified. – whuber Apr 02 '11 at 15:57
  • 1
    @whuber - estimate is a bad choice of words for me (you may have noticed I'm not the best wordsmith), what I meant was that although $\rho$ may not be uniquely identified, this does not mean that the data are useless in telling us about $\rho$. My answer gives an (ugly) demonstration of this from an algebraic point of view. – probabilityislogic Apr 02 '11 at 16:59
  • @Probability It seems your analysis is contradictory: if indeed y is modeled with a normal distribution, then a sample of five 2's shows this model is inappropriate. Ultimately, you don't get something for nothing: your results depend strongly on the assumptions made about the priors. The original problems in identifying $\rho$ are still there but have been hidden by all these additional assumptions. That seems IMHO just to obscure the issues rather than clarify them. – whuber Apr 02 '11 at 17:30
  • @whuber - I'd say it actually "brings out" all the assumptions that one actually makes when using pearson correlation (normality). Its not that you get something for nothing, its that it highlights the "something" that was already there, left unspecified or thrown out with the passage to infinite limits in the parameter space and/or sample space (which is not a reflection of "reality" but a mathematical convenience which enables closed form solutions to many problems). – probabilityislogic Apr 02 '11 at 17:51
  • @whuber - the five 2's are highly implausible, but so is any five numbers with continuous distributions. All the 5 two's mean is that the data do not rule out arbitrarily small variances, so the results do depend on what is known about small variances apart from the data. the fact that @andree wants a correlation suggests that he has information other than data which indicates non-zero variation. So why not use that information in the most conservative way, by specifying a lower bound? – probabilityislogic Apr 02 '11 at 17:58
  • @probability I agree your analysis makes some assumptions explicit, but I don't think they are ones that most people "usually" make. For instance, one does not normally assume all correlations are equally likely (your uniform prior on $\rho$.) There are similar unsupported assumptions in the other priors. Some unstated assumptions are that it even makes sense to use priors and that "uninformative" priors are not appreciably influencing the results. Only the OP can settle the former question, but the latter question bothers me. – whuber Apr 02 '11 at 17:58
  • @prob An alternative model for five 2's is that measurements are discrete. That leads to a potentially fruitful investigation of measurement process and measurement error and possibly incorporating it in the model. Then, if you use Bayesian techniques, that's fine, because the priors can be justified. What bothers me is proposing priors without any such analysis. One ill effect of that approach is that it can inhibit people from thinking constructively about the data. – whuber Apr 02 '11 at 18:01
  • @whuber - if the "true" standard deviation of these measurements was say $0.1$ with "true" mean $2$, this is consistent with the observed data. A discrete model would also be useful here. I think we are debating over certain ambiguities in the question though. I took pearson correlation to mean that normal distribution was implied - you feel that this is not justified, but do not give an alternative solution and declare this an "ill-posed" problem (implicitly). And the uniform prior for $p(\rho)$ has nothing to do with the point of my answer. – probabilityislogic Apr 02 '11 at 23:56
  • (cont'd) this is why I left $p(\rho)$ arbitrary, with a possible recommendation for uniform. the point of my answer was to indicate that even with the most non-informative priors for the other parameters can give some information about the true correlation. The problem just needs to be specified in a little more detail, to extract what is actually known about the data $x$ and $y$. The priors I use for the means and standard deviations can be justified on conservative grounds. If more is known, then the analysis will be improved. – probabilityislogic Apr 03 '11 at 00:03

4 Answers4

10

The "sampling theory" people will tell you that no such estimate exists. But you can get one, you just need to be reasonable about your prior information, and do a lot harder mathematical work.

If you specified a Bayesian method of estimation, and the posterior is the same as the prior, then you can say the data say nothing about the parameter. Because things may get "singular" on us, then we cannot use infinite parameter spaces. I am assuming that because you use Pearson correlation, you have a bivariate normal likelihood:

$$p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)=\left(\sigma_x\sigma_y\sqrt{2\pi(1-\rho^2)}\right)^{-N}exp\left(-\frac{\sum_{i}Q_i}{2(1-\rho^2)}\right)$$ where $$Q_i=\frac{(x_i-\mu_x)^2}{\sigma_x^2}+\frac{(y_i-\mu_y)^2}{\sigma_y^2}-2\rho\frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma_x\sigma_y}$$

Now to indicate that one data set may be the same value, write $y_i=y$, and then we get:

$$\sum_{i}Q_i=N\left[\frac{(y-\mu_y)^2}{\sigma_y^2}+\frac{s_x^2 + (\overline{x}-\mu_x)^2}{\sigma_x^2}-2\rho\frac{(\overline{x}-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right]$$ where $$s_x^2=\frac{1}{N}\sum_{i}(x_i-\overline{x})^2$$

And so your likelihood depends on four numbers, $s_x^2,y,\overline{x},N$. So you want an estimate of $\rho$, so you need to multiply by a prior, and integrate out the nuisance parameters $\mu_x,\mu_y,\sigma_x,\sigma_y$. Now to prepare for integration, we "complete the square" $$\frac{\sum_{i}Q_i}{1-\rho^2}=N\left[\frac{\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{\sigma_y^2(1-\rho^{2})}+\frac{s_x^2}{\sigma_{x}^{2}(1-\rho^{2})} + \frac{(\overline{x}-\mu_x)^2}{\sigma_x^2}\right]$$

Now we should err on the side of caution and ensure a properly normalised probability. That way we can't get into trouble. One such option is to use a weakly informative prior, which just places restriction on the range of each. So we have $L_{\mu}<\mu_x,\mu_y<U_{\mu}$ for the means with flat prior and $L_{\sigma}<\sigma_x,\sigma_y<U_{\sigma}$ for the standard deviations with jeffreys prior. These limits are easy to set with a bit of "common sense" thinking about the problem. I will take an unspecified prior for $\rho$, and so we get (uniform should work ok, if not truncate the singularity at $\pm 1$):

$$p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)=\frac{p(\rho)}{A\sigma_x\sigma_y}$$

Where $A=2(U_{\mu}-L_{\mu})^{2}[log(U_{\sigma})-log(L_{\sigma})]^{2}$. This gives a posterior of:

$$p(\rho|D)=\int p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)d\mu_y d\mu_x d\sigma_x d\sigma_y$$

$$=\frac{p(\rho)}{A[2\pi(1-\rho^2)]^{\frac{N}{2}}}\int_{L_{\sigma}}^{U_{\sigma}}\int_{L_{\sigma}}^{U_{\sigma}}\left(\sigma_x\sigma_y\right)^{-N-1}exp\left(-\frac{N s_x^2}{2\sigma_{x}^{2}(1-\rho^{2})}\right) \times$$ $$\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N(\overline{x}-\mu_x)^2}{2\sigma_x^2}\right)\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{2\sigma_y^2(1-\rho^{2})}\right)d\mu_y d\mu_x d\sigma_x d\sigma_y$$

Now the first integration over $\mu_y$ can be done by making a change of variables $z=\sqrt{N}\frac{\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\sigma_y\sqrt{1-\rho^{2}}}\implies dz=\frac{\sqrt{N}}{\sigma_y\sqrt{1-\rho^{2}}}d\mu_y$ and the first integral over $\mu_y$ becomes:

$$\frac{\sigma_y\sqrt{2\pi(1-\rho^{2})}}{\sqrt{N}}\left[\Phi\left( \frac{U_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)-\Phi\left( \frac{L_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)\right]$$

And you can see from here, no analytic solutions are possible. However, it is also worthwhile to note that the value $\rho$ has not dropped out of the equations. This means that the data and prior information still have something to say about the true correlation. If the data said nothing about the correlation, then we would be simply left with $p(\rho)$ as the only function of $\rho$ in these equations.

It also shows how that passing to the limit of infinite bounds for $\mu_y$ "throws away" some of the information about $\rho$, which is contained in the complicated looking normal CDF function $\Phi(.)$. Now if you have a lot of data, then passing to the limit is fine, you don't loose much, but if you have very scarce information, such as in your case - it is important keep every scrap you have. It means ugly maths, but this example is not too hard to do numerically. So we can evaluate the integrated likelihood for $\rho$ at values of say $-0.99,-0.98,\dots,0.98,0.99$ fairly easily. Just replace the integrals by summations over a small enough intervals - so you have a triple summation

  • @probabilityislogic: Wow. Simply wow. After seen some of your answers I really wonder: what should a doofus like me do to reach such a flexible bayesian state of mind ? – steffen Apr 07 '11 at 15:02
  • 1
    @steffen - lol. Its not that difficult, you just need to practice. And always always always remember that the product and sum rules of probability are the only rules you will ever need. They will extract whatever information is there - whether you see it or not. So you apply product and sum rules, then just do the maths. That is all I have done here. – probabilityislogic Apr 07 '11 at 15:07
  • @steffen - and the other rule - more a mathematical one than stats one - don't pass to an infinite limit too early in your calculations, your results may become arbitrary, or little details may get thrown out. Measurement error models are a perfect example of this (as is this question). – probabilityislogic Apr 07 '11 at 15:11
  • @probabilityislogic: Thank you, I'll keep this in mind... as soon as I am done working through my "Bayesian Analysis"-copy ;). – steffen Apr 07 '11 at 15:33
  • @probabilityislogic: If you could humor a nonmathematical statistician/researcher...would it be possible to summarize or translate your answer to a group of dentists or high school principals or introductory statistics students? – rolando2 Apr 30 '11 at 16:53
  • @rolando2 - the first thing I wouldn't do is show them the above formulas - enough to scare most people away, or at least make their eyes glaze over and the "smile and nod" would happen. This would be best done using a graph of the integrated likelihood for various prior ranges $U_{\mu},L_{\mu},U_{\sigma},L_{\sigma}$ - one because this is what matters and two because graphs are easy to understand and to explain. This will answer exactly how much the data and prior information say about the correlation. my point was that its not nothing. – probabilityislogic May 01 '11 at 02:06
  • @prob - I appreciate your giving it a shot. I'm still unable to follow. – rolando2 May 03 '11 at 03:27
6

I agree with sesqu that the correlation is undefined in this case. Depending on your type of application you could e.g. calculate the Gower Similarity between both vectors, which is: $gower(v1,v2)=\frac{\sum_{i=1}^{n}\delta(v1_i,v2_i)}{n}$ where $\delta$ represents the kronecker-delta, applied as function on $v1,v2$.

So for instance if all values are equal, gower(.,.)=1. If on the other hand they differ only in one dimension, gower(.,.)=0.9. If they differ in every dimension, gower(.,.)=0 and so on.

Of course this is no measure for correlation, but it allows you to calculate how close the vector with s>0 is to the one with s=0. Of course you can apply other metrics,too, if they serve your purpose better.

steffen
  • 10,367
0

The correlation is undefined in that case. If you must define it, I would define it as 0, but consider a simple mean absolute difference instead.

sesqu
  • 691
0

This question is coming from programmers, so I'd suggest plugging in zero. There's no evidence of a correlation, and the null hypothesis would be zero (no correlation). There might be other context knowledge that would provide a "typical" correlation in one context, but the code might be re-used in another context.

zbicyclist
  • 3,423
  • 2
    There's no evidence of lack of correlation either, so why not plug in 1? Or -1? Or anything in between? They all lead to re-usable code! – whuber Apr 02 '11 at 18:02
  • @whuber - you plug in zero because the data is "less constrained" when it is independent - this is why maxent distributions are independent unless you explicitly specify correlations in the constraints. Independence can be viewed as a conservative assumption when you know of no such correlations - effectively you are averaging over all possible correlations. – probabilityislogic Apr 03 '11 at 00:07
  • 1
    @prob I question why it makes sense as a generic procedure to average over all correlations. In effect this procedure substitutes the definite and possibly quite wrong answer "zero!" for the correct answer "the data don't tell us." That difference can be important for decision making. – whuber Apr 28 '11 at 14:32
  • Just because the question might be from a programmer, does not mean you should convert an undefined value to zero. Zero means something specific in a correlation calculation. Throw an exception. Let the caller decide what should happen. Your function should calculate a correlation, not decide what to do if one cannot be computed. – Jared Becksfort Sep 14 '19 at 22:28