There are two Boolean vectors, which contain 0 and 1 only. If I calculate the Pearson or Spearman correlation, are they meaningful or reasonable?
7 Answers
The Pearson and Spearman correlation are defined as long as you have some $0$s and some $1$s for both of two binary variables, say $y$ and $x$. It is easy to get a good qualitative idea of what they mean by thinking of a scatter plot of the two variables. Clearly, there are only four possibilities $(0,0), (0,1), (1, 0), (1,1)$ (so that jittering to shake identical points apart for visualization is a good idea). For example, in any situation where the two vectors are identical, subject to having some 0s and some 1s in each, then by definition $y = x$ and the correlation is necessarily $1$. Similarly, it is possible that $y = 1 -x$ and then the correlation is $-1$.
For this set-up, there is no scope for monotonic relations that are not linear. When taking ranks of $0$s and $1$s under the usual midrank convention the ranks are just a linear transformation of the original $0$s and $1$s and the Spearman correlation is necessarily identical to the Pearson correlation. Hence there is no reason to consider Spearman correlation separately here, or indeed at all.
Correlations arise naturally for some problems involving $0$s and $1$s, e.g. in the study of binary processes in time or space. On the whole, however, there will be better ways of thinking about such data, depending largely on the main motive for such a study. For example, the fact that correlations make much sense does not mean that linear regression is a good way to model a binary response. If one of the binary variables is a response, then most statistical people would start by considering a logit model.
- 56,404
- 8
- 127
- 185
-
3Does that mean in this situation, Pearson or Spearman correlation coefficient is not a good similarity metric for this two binary vectors? – Zhilong Jia Jun 23 '14 at 11:33
-
Yes in the sense that it doesn't measure similarity and is undefined for all 0s or all 1s for either vector. – Nick Cox Jun 23 '14 at 12:26
-
The 2 identical or 'opposite' vectors case is not clear to me. If x=c(1,1,1,1,1) and y=(0,0,0,0,0) then y=1-x and it sounds like you're saying this must be the case by definition, implying correlation of -1. Equally y=x-1 implying correlation of +1. There is only 1 point (5 replicates) on a scatterplot so any straight line could be drawn through it. It feels like the correlation is undefined in this instance. Sorry if I misunderstood what you meant. @NickCox – PM. Oct 01 '19 at 16:51
-
2No; I am not saying that, as I do point out in my first sentence that you must have a mix of 0s and 1s for the correlation to be defined. Otherwise if the SD of either variable is 0 then the correlation is undefined. But I have edited my answer to mention that twice. – Nick Cox Oct 02 '19 at 07:40
There are specialised similarity metrics for binary vectors, such as:
- Jaccard-Needham
- Dice
- Yule
- Russell-Rao
- Sokal-Michener
- Rogers-Tanimoto
- Kulzinsky
etc.
For details, see here.
- 2,557
-
10Surely there are many more reliable and comprehensive references. Even on the level of getting authors' names right, note Kulczyński and Tanimoto. See e.g. Hubálek, Z. 1982. Coefficients of association and similarity, based on binary (presence-absence) data: An evaluation. Biological Reviews 57: 669–689. – Nick Cox Jul 27 '15 at 15:43
-
7They have obviously misspelled 'Tanimoto' but 'Kulzinsky' has been purposely simplified. Your reference is more credible without a doubt but it's not accessible to everybody. – Digio Jul 28 '15 at 08:38
-
3scipy contains many of these, see this page, under "Distance functions between two boolean vectors (representing sets) u and v." – Alex Moore-Niemi Dec 23 '19 at 02:31
-
These are NOT correlation metrics! Indeed notice the example below: independent draws leading to high scores (while pearson correctly shows 0 correlation!). Almost everyone on this thread is confusing marginal distributions with correlations. – semola Mar 09 '23 at 15:32
-
1
-
@semola - that's why I used the term "similarity" rather than "correlation" in my answer. – Digio Jun 25 '23 at 16:38
I would not advise to use Pearson's correlation coefficient for binary data, see the following counter-example:
set.seed(10)
a = rbinom(n=100, size=1, prob=0.9)
b = rbinom(n=100, size=1, prob=0.9)
in most cases both give a 1
table(a,b)
> table(a,b)
b
a 0 1
0 0 3
1 9 88
but the correlation does not show this
cor(a, b, method="pearson")
> cor(a, b, method="pearson")
[1] -0.05530639
A binary similarity measure such as Jaccard index shows however a much higher association:
install.packages("clusteval")
library('clusteval')
cluster_similarity(a,b, similarity="jaccard", method="independence")
> cluster_similarity(a,b, similarity="jaccard", method="independence")
[1] 0.7854966
Why is this? See here the simple bivariate regression
plot(jitter(a, factor = .25), jitter(b, factor = .25), xlab="a", ylab="b", pch=15, col="blue", ylim=c(-0.05,1.05), xlim=c(-0.05,1.05))
abline(lm(a~b), lwd=2, col="blue")
text(.5,.9,expression(paste(rho, " = -0.055")))
plot below (small noise added to make the number of points clearer)

- 3,513
-
5This should be upvoted more. Pearson for binary data can be misleading. – abudis Apr 09 '20 at 06:26
-
7The Pearson correlation is doing its job here and the scatter plot makes plain why the correlation is weak. Whether other measures or analyses make more sense for other goals are important but different questions. . – Nick Cox Jun 16 '20 at 12:10
-
1@NickCox, I disagree. The question is whether it is meaningful or reasonable to use the Pearson's correlation coefficient (not wether it can be applied on this data on general). I would advise against it because there are more reasonable measures for this kind of data. You can also use OLS for binary data (and often it makes sense) but it is not always reasonable since you can get fitted values outside of [0, 1] among others. – Arne Jonas Warnke Feb 20 '21 at 09:24
-
1Once again: Correlation here has a meaning so long as both variables are genuinely variable; the same reservation applies in general, as correlations are not defined if even one variable is constant. Despite the root of reason what is said to be reasonable or not is usually a matter of rhetoric. not logic. Many economists and some others have an odd habit of using OLS (an estimation method) as a short-hand for the simplest kind of linear regression. The objection that fits aren't confined to $[0, 1]$ is one I have made many times myself in discussion, and there are other objections too. – Nick Cox Feb 20 '21 at 09:51
-
So we disagree here on the meaning of reasonable, that is fine I guess. And btw I agree that there are other objections too which I tried to imply by adding "among others" – Arne Jonas Warnke Feb 20 '21 at 09:57
-
But linear regression amounts to using means and the mean proportion is as natural and as useful a quantity as you can hope for. And the linear probabillity model has many defenders as in practice simple to apply, simple to explain and often working as well as anything else with the data in hand. All this said, nothing in my answer or comments denies that there are often better measures (e.g. for measuring agreement between binary vectors) or models (I'd reach for logit regression usually myself). In practice, I rarely correlate two zero-one indicators myself. – Nick Cox Feb 20 '21 at 09:58
-
I have a serious if amateur interest in English usage and etymology and even in philosophy. Reasonable is usually a just debating term: it is no surprise that one's own position is reasonable and the other party's position is not. I don't think there is a deeper meaning struggling to get out. – Nick Cox Feb 20 '21 at 09:59
-
-
4I consider this answer simply wrong. The two variables were generated through completely independent mechanisms, so it's very intuitive that the correlation is ~ 0. Analogously, if one used a similar approach to generate two independent Gaussian variables, and set the mean of both to be 40 with sd of 1, just because the values are 'similar' doesn't mean you would expect a correlation. – Charlie Apr 12 '21 at 19:32
-
@Charlie, which statement is wrong? You can generate the same results based on a highly correlated bivariate distribution. Or think you would not know the data generating process. The idea is just the following: You observe both variables occuring often together. A binary similarity measure indicates exactly that while the correlation coefficient, usually used for continuous data, does not. – Arne Jonas Warnke Apr 13 '21 at 08:01
-
Here is a small script generating dependent binary data, the correlation is high as expected. Sorry for formatting... invlogit <- function(x) {; exp(x)/(1+exp(x)); }; ; corbase <- matrix(c(1,2,0,1),2,2); mcor <- cov2cor(corbase %% t(corbase)); print(mcor); corchol <- t(chol(mcor)); ; r <- matrix(rnorm(2000000),ncol=2); cdat <- t(corchol %% t(r)); cor(cdat); #pearson correlation in the Gaussian data ; bdat <- matrix(round(invlogit(cdat),0),ncol=2); #turn continuous into binary data cor(bdat); #pearson correlation in the binarised data – Charlie Apr 13 '21 at 12:40
-
Sorry, comment editing locked me from fixing. Here is a gist with the script: https://gist.github.com/cdriveraus/0cd88396a3112ef0140e77607fd528eb – Charlie Apr 13 '21 at 12:47
-
I have one word for everyone: assumptions. it all depends of your assumptions. even if you don't think you have them you do. don't just run every method possible, first identify your assumptions and then choose the method based on them. – raygozag Nov 04 '21 at 03:17
-
-
Sigh... though swearing never to be involved again, this pseudo answer is pulling me back into my SE account to leave this comment: this answer is incorrect. The data generation currently demonstrated (software is R, by the by) shows independent random variables, which we wouldn't expect to be correlated. I won't debate the logic one way or the other, but beware this demonstrates: it is misleading (ie: doing this with rnorm (continuous normal variable generation, meeting all assumptions) would, on average, give 0 correlation as well). Charlie (above) is right on the money. – mflo-ByeSE Dec 26 '22 at 18:00
Arne's response above isn't quite right. Correlation is a measure of dependence between variables. The samples A and B are both independent draws, although they are from the same distribution, so we should expect ~0 correlation.
Running a similar simulation and creating a new variable c that is dependent on the value of a:
from scipy import stats
a = stats.bernoulli(p=.9).rvs(10000)
b = stats.bernoulli(p=.9).rvs(10000)
dep = .9
c = []
for i in a:
if i ==0:
# note this would be quicker with an np.random.choice()
c.append(stats.bernoulli(p=1-dep).rvs(1)[0])
else:
c.append(stats.bernoulli(p=dep).rvs(1)[0])
We can see that the
stas.pearsonr(a,b) ~= 0
stas.pearsonr(a,c) ~= 0.6
stats.spearmanr(a,c) ~=0.6
stats.kendalltau(a,c) ~=0.6
- 121
A possible issue with using the Pearson correlation for two dichotomous variables is that the correlation may be sensitive to the "levels" of the variables, i.e. the rates at which the variables are 1. Specifically, suppose that you think the two dichotomous variables (X,Y) are generated by underlying latent continuous variables (X*,Y*). Then it is possible to construct a sequence of examples where the underlying variables (X*,Y*) have the same Pearson correlation in each case, but the Pearson correlation between (X,Y) changes. The example below in R shows such a sequence. The example shifts the continuous latent (X*,Y*) distribution to the right along the x-axis (not changing the shape of the latent distribution at all), and finds that the Pearson correlation between (X,Y) decreases as we do so.
For this reason, you might consider using the tetrachoric correlation for dichotomous data, if it is feasible to estimate. This question has more details on the polychoric correlation, which is a generalization of the tetrachoric.
# consider two dichotomous variables x and y that are each generated by an
# underlying common standard normal normal factor and a unique standard normal
# normal factor, plus x has a shift u that makes it more common than 50:50
set.seed(12345)
library(polycor)
N <- 10000
U <- seq(0,1.2,0.1)
dout <- list()
for(u in U) {
print(u) # u is the shift
common <- rnorm(N) # common factor
xunderlying <- common*0.7 + rnorm(N)*0.3 + u
yunderlying <- common*0.7 + rnorm(N)*0.3
plot(xunderlying,yunderlying)
abline(v = mean(xunderlying),col='red')
abline(h = mean(yunderlying),col='red')
x <- xunderlying > 0 # would be 50:50 chance if u = 0
y <- yunderlying > 0
print(table(x,y))
# obtain tetrachoric correlation using polycor package
p <- polycor::polychor(x,y,ML=TRUE,std.err = TRUE)
dout <- rbind(dout,
data.frame(U=u,
pctx = mean(x), # percent of x that is TRUE, used below
pcty=mean(y),
cor=cor(x,y), # pearson correlation, used below
polychor_rho=p$rho, # tetrachoric correlation, used below
underlying_cor = cor(xunderlying,yunderlying), # underlying correlation, used below
polychor_xthresh = p$row.cuts,
polychor_ythresh = p$col.cuts))
}
# plot underlying cor as a function of pctx.
# does not depend on pctx
plot(dout$pctx,dout$underlying_cor,ylim = c(0,1))
# plot pearson correlation as a function of pctx (which is determined by u).
# decreasing in pctx!
plot(dout$pctx,dout$cor,ylim = c(0,1))
# plot estimated tetrachoric correlation as a function of pctx.
# does not depend on pctx
plot(dout$pctx,dout$polychor_rho,ylim = c(0,1))
- 303
-
2This is a red herring. Sure, if the (0, 1) values are based on degrading other scales then the Pearson correlation between indicator variables isn't capturing information it was never shown. That is not a fault of the method, The implication is to use other methods. – Nick Cox Jun 16 '20 at 12:15
-
Good point. Economists (at least) often view indicator variables as degraded versions of continuous variables, but there may be cases where viewing them in another way motivates pearson rather than tetrachoric. – Richard DiSalvo Jun 17 '20 at 15:00
Simple answer (building on @Carlos M's comment): Yes, in particular, computing Pearson's correlation is meaningful, in fact when you run Pearson's on binary data it gives you equivalent of the Phi coefficient ( https://en.wikipedia.org/wiki/Phi_coefficient). The awesome thing here is that R, Pandas, and SQL engines already have this as their default correlation. You can either split out the binary fields separately or just pile them in and compute it all at once :)
import pandas as pd
Sample DataFrame with continuous and binary data
df = pd.DataFrame({
'cont1': [1.2, 3.4, 2.2, 4.5, 5.2],
'cont2': [2.4, 3.1, 4.5, 6.1, 5.3],
'binary1': [0, 1, 0, 1, 1],
'binary2': [1, 0, 1, 0, 0],
'binary3': [0, 1, 0, 0, 1]
})
Compute correlation for all columns
df.corr(method='pearson')
cont1 cont2 binary1 binary2 binary3
cont1 1.000000 0.772457 0.893869 -0.893869 0.558668
cont2 0.772457 1.000000 0.496162 -0.496162 -0.047823
binary1 0.893869 0.496162 1.000000 -1.000000 0.666667
binary2 -0.893869 -0.496162 -1.000000 1.000000 -0.666667
binary3 0.558668 -0.047823 0.666667 -0.666667 1.000000
- 133
I'd like to add that if the goal of this correlation analysis is feature selection in a machine learning context, I'd suggest looking at mutual information.
From scikit-learn's mutual_info_classif link:
Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
For two random variables, you should use mutual_info_score link
- 45
- 5
cor()) is defined for non-uniform vectors and has a name. It also has equivalent interpretability for the Pearson's correlation: https://en.wikipedia.org/wiki/Phi_coefficient – Carlos M. Dec 09 '20 at 17:04