0

I'm writing code to calculate if the correlation between two random variables is significant.

I've recently come across Fisher's z transformation as a method for finding significance. But from reading around:

  1. SAS on Fisher's z
  2. Wikipedia on Fisher's z

it seems this transform only applies to normal variables. A lot of the variables I'm working with aren't normal. Is there a corresponding transform for non-normal random variables?

Background

The variables I'm dealing with

  1. Most of my variables have some amount of skew and so are not perfectly normally distributed.
  2. My dataset also has binary indicator variables, with Bernoulli distributions.

The excerpt from Wikipedia I'm concerned about

If $(X, Y)$ has a bivariate normal distribution with correlation ρ and the pairs $(X_i, Y_i)$ are independent and identically distributed, then $z$ is approximately normally distributed with mean $${1 \over 2}\ln \left({{1+\rho } \over {1-\rho }}\right),$$ and standard error $${1 \over {\sqrt {N-3}}},$$

Connor
  • 625
  • Why do you want to apply this transform? What kind of correlation are you interested in testing? – user2974951 May 19 '23 at 08:26
  • 4
    Fisher's (not Fischer's) z transform is applied to correlations, not the original data. If you're worried that the original data are a long way from normal, then consider bootstrapping your correlations, or transforming the variables. – Nick Cox May 19 '23 at 09:05
  • I know that, but if you take a look at the definition on Wikipedia: https://en.m.wikipedia.org/wiki/Fisher_transformation#Definition, I'm concerned about this statement: "If (X, Y) has a bivariate normal distribution with correlation ρ and the pairs (Xi, Yi) are independent and identically distributed, then z is approximately normally distributed with mean"

    Wouldn't this mean the underlying variables (X, Y) have to be normal?

    – Connor May 19 '23 at 09:23
  • 2
    As the sentence you quote there makes explicit (it has the form "If A then B") the derivation of the transformation relies on X and Y being bivariate normal; however, there's a number of issues here, of which I'll mention two 1. As just alluded to, it's not just the marginal distributions, but the joint distribution of the pair (X,Y) that determines the distribution of the Pearson correlation coefficient (and hence what r-distribution needs to be transformed to be "more nearly normal"). ... ctd – Glen_b May 19 '23 at 09:34
  • 3
    ctd ... 2. For most joint distributions, linear correlation doesn't fully describe the dependence (including that "uncorrelated" no longer necessarily implies "independent"), so the more basic question of whether linear correlation is necessarily a particularly useful description of the dependence arises. $:$ . . .$,$ If you are confident that a linear correlation is what you want, then I tend to agree with Nick Cox, that a permutation test might be a good approach, perhaps specifically using a studentized correlation for the statistic as recommended by DiCiccio and Romano (JASA, 2017). – Glen_b May 19 '23 at 09:35
  • 2
    I think, from your edit, that we're dealing with an XY problem (see also Wikipedia). Specifically your problem is not doing something like a z-transform at all. At first glance it now sounds like your problem is that you want to test for independence (vs some form of association) for some specific variables (many of which are binary). . . . However, I think that this, too, may potentially be an XY problem of its own. Why are you looking at testing a large collection of bivariate correlations? – Glen_b May 19 '23 at 09:48
  • 1
    Note that (for the bivariate normal) the Fisher transform is really only needed when the population correlation you're testing is non-zero; the usual correlation t-test on $t=r\sqrt{\frac{n-2}{1-r^2}}$ works perfectly well for testing a null correlation. The same test works "as is" in a much wider class of cases than bivariate normality (per regression, it's derived assuming conditional normality of one variable given the other) and is fairly robust to that assumption. – Glen_b May 19 '23 at 09:56
  • 1
    Although I have written on the z transform elsewhere -- as a way of getting confidence intervalsm for correlation, although a test can be linked -- I agree with @Glen_b that expansion of the question, and extra comments, make this seem a case of the X-Y problem. It seems that you're expecting correlations to do the job of finding relationships and worrying how to do it best. Whether it will work well at all would be my concern. In particular, correlation has a meaning if either or both variables are (0, 1) so long as neither is constant, but that is a long way away from bivariate normal. – Nick Cox May 19 '23 at 11:29
  • It could be X-Y in the sense that I'm trying to find a confidence interval for the amount of information shared between two variables, which is not explicitly what I'm asking for here. Correlation may not be a reasonable measure at all for some of these pairs. But I have stated, at the top of the question, that I'm interested in the significance of the correlation between the two variables. Which I believe Fisher's z test gives me, but only in a specific circumstance, hence the question of other possible transformations! – Connor May 19 '23 at 12:28
  • @Glen_b Thank you for such a detailed comment chain. So does that mean that Fisher's z statistic is slightly broader than I originally thought, and as long as the bivariate distribution is normal, it's still reasonable to use Fisher's z test. I'd also be interested to know your opinion on using it as a heuristic, it may not be exact, but is it close to the ideal method? – Connor May 19 '23 at 12:29
  • Repeating myself: 1. "calculate if the correlation between two random variables is significant" means you're testing a null of $\rho=0$ vs $\rho\neq 0$. You don't need an asymptotic adjustment for skewness of $r$ under bivariate normal when the null distribution is already symmetric; use the t-test in that case, it's exact small-sample. 2. This t-test doesn't rely on bivariate normality, but on the weaker regression assumptions. It's also fairly robust. 3. If you want protection against non-normality, Nick's mention of permutation tests is relevant. (I still think this is an XY problem) – Glen_b May 20 '23 at 04:32
  • @Glen_b I don't understand what you mean. Why would my null be an assumed correlation of $\rho = 0$? I'm looking at variables with correlations closer to $\rho = 0.6$, I'd like to know how significant that correlation is. In that scenario, won't my distribution be skewed? I would be happy to try the t-test, what's the best way to find a resource on how to do the t-test for correlation? Is it XY though, or just a simplified statement of what the issue is? There's almost always a deeper reason behind every question, but is the extra information useful? Past experience tells me no. – Connor May 20 '23 at 05:18
  • @Glen_b Is this a good example of using the t-test to calculate the significance of a correlation? https://www.statology.org/t-test-for-correlation/ – Connor May 20 '23 at 05:37
  • 1
  • If your null is not $\rho=0$, what do you mean by "significant"? Where is your null coming from? 2. Consequently, if the population correlation (not sample correlation were 0.6) the sample distribution of $r$ would be skewed, but this is not relevant, because the distribution you need to compute is the null distribution of the test statistic. That is, in order to show that $r$ is not compatible with $\rho=0$, you need to use the distribution of $r$ when $\rho=0$ and show that your $r_\text{obs}$ is too extreme for that to be plausible.
  • – Glen_b May 20 '23 at 06:04
  • 1
  • (sorry, this is responding to your earlier comment) Well, sure, that link is using the same formula I gave above (though I think there are better sources, it seems to be okay in this case); the remaining;. 4. I continue to believe it's an XY problem because you have not addressed why the significance of a large collection of bivariate correlations would be relevant for anything; it's usually not, albeit it's a common practice (somewhat misguided, but common). If you have some specific need to obtain p-values for a lot of bivariate correlations, what would that be?
  • – Glen_b May 20 '23 at 06:09
  • @Glen_b Thank you for engaging so much, this is really helpful! I'm trying to remove all the correlated columns in a dataset to reduce the noise in it before onward input into a machine learning model. I've set a reasonably high limit of 0.6 because I have a secondary feature selection step which can deal with unimportant columns. As the dataset is large I'd like to check correlation using a sample of the rows, but I'd also like some statistical test for significance of that correlation too. – Connor May 21 '23 at 06:07
  • I don't think you need testing for that purpose at all. I also don't think bivariate correlation is necessarily a great way to approach the problem of dependence among features. For example, it's perfectly easy to have large collections of variables that have very low pairwise correlation but which are collectively singular. On the other hand, it's possible to have many middling pairwise correlations (like r=0.6) which don't necessarily cause any problem at all. – Glen_b May 21 '23 at 10:40
  • @Glen_b why don't I need testing? What would you recommend then for getting a numerical representation of the correlation between columns? – Connor May 22 '23 at 11:46
  • @Glen_b I've posted another question that is related to what we've discussed here, that asks if correlation method is problem dependent. If you have the time to have a look I'd be very grateful! https://stats.stackexchange.com/questions/616589/is-correlation-method-problem-dependent – Connor May 22 '23 at 15:18