6

I've been working with an industry-related dataset where I need to analyze the correlation between a specific output (y) and several inputs (x1, x2, x3, etc.). During my search, I came across various correlation analysis methods such as Pearson, Kendall, Spearman, and Mutual Information. However, I'm facing a challenge: different methods yield completely different correlation rankings. In some cases, the variations are quite significant. For instance, one input might be the most correlated with y according to the Mutual Information method, but it ranks as one of the least correlated in other methods.

This inconsistency poses a dilemma for my analysis, which is crucial for an industry-related dataset. They need a clear understanding of how inputs correlate with y. I'm considering combining results from all methods but I'm worried that some might not be well-suited to my dataset, leading to inaccurate conclusions.

Could you advise on the most suitable correlation method for a general analysis? Can Mutual Information be considered the best method as it quantifies the "amount of information" obtained about one random variable by observing the other random variable? Are there any methods I should avoid because they are designed just for specific situations?

Also, I would appreciate any tips on how to effectively conclude and integrate the findings from different methods.

J-J-J
  • 4,098
  • 2
    Your use of the word "correlate" will tend to be interpreted in terms of some kind of formal correlation coefficient here on CV, but is that what you really mean? "How inputs correlate with y" sounds like you are looking for some form of multiple regression. Compared to looking at individual "correlations" (however those might be defined), multiple regression goes much more deeply into the mutual relationships among the x's and y by accounting for the relationships among the x's, too. – whuber Jan 31 '24 at 16:31
  • 1
    Minimal comment to allow future readers to make full sense of this thread. OP posted the question, got answers as below, and also made some comments in reply to answers, giving more context to their question, and asking for more advice. Then for whatever reasons OP deleted their comments, and indeed their question, and deleted the account. The deletion of the question was reversed, but the OP's comments are no longer visible. Several comments by various people seeking to engage with the OP positively have also been deleted. – Nick Cox Feb 03 '24 at 11:08

3 Answers3

13

I have to say that this question is, to me, posed backwards, for several quite different reasons. Here are some.

  1. Only exceptionally can different correlations be expected to agree, as they are based on quite different criteria. Indeed mutual information criterion isn't even a correlation in the traditional but useful sense of being bounded by $-1$ and $1$ with interpretable limits. This isn't resolved by discussions in principle about which is best, which is no more fruitful than discussing which pen or bag or car is best.

  2. When the evidence from correlations is puzzling or ambiguous, the only real way to resolve that is to look more closely at the data, preferably with some graphics, and indeed this is a lot of work for any non-trivial dataset. This could show up all sorts of quite different set-ups, including

(a) there is no appreciable relationship, so all correlations are more or less futile or unhelpful

(b) there is a relationship but not one described even roughly by a straight line or monotonic curve

(c) something is going on but you may need something like a transformation (e.g. if outliers are dominating Pearson correlation) to think about it appropriately.

  1. There is more recent work purporting to give more versatile measures of correlation, such as energy correlation or the work of Chatterjee. The optimists think that these are exciting and clever ideas. The pessimists think that naive users will confuse answers to "Is there a strong correlation?" with "Is there a relationship and what is it?" which in my experience is always closer to the real scientific or practical problem.

As a very simple example: What is the correlation between time of day and temperature? It's about zero, presumably, because rising limbs as the air heats up are balanced by falling limbs with the opposite. But naturally there is a relationship there best described in other terms.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 1
    You're asking several good extra questions in your comments, but that makes a single focused reply even harder. The question is getting closer to what do I need to know to build a model using machine learning on a large dataset. – Nick Cox Feb 01 '24 at 00:36
12

I agree with the answers by Frank and Nick. Adding to them:

Could you advise on the most suitable correlation method for a general analysis?

No, and I'd be leery of anyone who did so. Frank's answer goes about as far as I think you could go in this direction, but it is also, essentially a "no".

Can Mutual Information be considered the best method as it quantifies the "amount of information" obtained about one random variable by observing the other random variable?

I don't think mutual information is a correlation. As to when you should use which, and what they are for, see this thread But, often, you want both.

Are there any methods I should avoid because they are designed just for specific situations?

The methods you mention are all of fairly general application. But the method you should avoid is thinking that there will be one method that works in all cases. One of my favorite statistical quotations is

There are no routine statistical questions, only questionable statistical routines. --- Sir David Cox.

I also wonder why you are ranking the correlation methods. This obscures things. The gap between the highest and lowest measure may be small or large. If it is quite small, then the rankings are pretty much arbitrary, if they are large then the very fact that there is a big difference indicates that you need to look more carefully.

Also, why correlation at all? You call y an "output" and the various x's "inputs". Correlation doesn't distinguish. If you want to look at outputs and inputs, then some sort of regression is the usual way.

Finally, it would help if you told us what industry, what variables, how they are measured, what your sample size is, and, most crucially, what you want to find out, not in statistically terms, but in substantive ones.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • #1 Good to do EDA. I would use graphs. 150 graphs is not so many. For the large number of samples, you can make the plotting symbol a very small dot, or transparent, in a scatter plot. For choice of correlation, see Frank's answer.

    #2 There's a whole literature on model building. I wouldn't use correlation in this way as part of that. You probably have some substantive knowledge that can guide things.

    – Peter Flom Jan 31 '24 at 22:23
11

When the variables are almost continuous and relationships are not expected to be very non-monotonic my default choice is Spearman’s $\rho$. If you expect non-monotonic relationships that do not change directions more than once, then a 2 degree of freedom generalization of $\rho$ is handy. See https://hbiostat.org/rmsc/multivar

If examining more than a few variables, it may be important to quantify the uncertainty in rankings of coefficients. Https://hbiostat.org/bbr/hdata gives some simulation and bootstrap approaches.

Note that the minimum sample size needed to estimate a single correlation coefficient well is 400.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397