13

Let’s say you are trying to find if there is a correlation between two stock prices, where both are likely non stationary series. You have no concern as it relates to a potentially causal relationship...

You run a simple correlation analysis against all the rules. Both our series are autocorrelated and non stationary. You find there is a 98% correlation so you conclude they depend on each other.

This is the conversation I just had with a colleague... but I think they are 100% wrong and I’d like some confirmation.

If you find two autocorrelated and non stationary series to be 98% correlated, then the correlation is likely spurious. What this means to me is that the correlation we observe is likely due to complete chance (and their correlation is likely a result of their mutual dependence on something else outside of the two series themselves). So if our goal is to identify the extent to which these two series “depend” on each other, finding a valid correlation coefficient is necessary. Is this correct?

Richard Hardy
  • 67,272
  • 2
    Looking at correlation between non-stationary series does not necessarily imply that the result is spurious; they could be cointegrated. In the context of stock prices, this is the basis of so-called "statistical arbitrage" and there's nothing statistically wrong with it. – Chris Haug Dec 09 '21 at 23:44
  • 4
    What is the goal of your correlation calculation? Is it simply to describe what happened in the past? Or would you like to use this correlation to make decisions about how to trade in the future? – Adrian Dec 09 '21 at 23:45
  • what is the formula for your 98% correlation ? –  Dec 10 '21 at 00:03
  • Adrian, just to describe what’s happened in the past. – user10136297 Dec 10 '21 at 03:06
  • If I am purely interested in just understanding what’s happened in the past, then I can run my correlation analysis and not care if it’s “spurious” right?? – user10136297 Dec 10 '21 at 05:09
  • You're correct in the sense that the analyses you've performed do not raise evidence against the hypothesis that this correlation is spurious and therefore there could indeed be a third variable, for example, that causes these two, making them actually independent when you're aware of this third one. – mribeirodantas Dec 19 '21 at 13:21
  • This question and it's answers seem like a deja vu. – Sextus Empiricus Dec 26 '21 at 22:56
  • https://stats.stackexchange.com/questions/263951/misunderstandings-of-spurious-correlation – Sextus Empiricus Dec 26 '21 at 22:57
  • Your question starts with "You have no concern as it relates to a potentially causal relationship..." but towards the end you start expressing such concerns "and their correlation is likely a result of their mutual dependence on something else outside of the two series themselves". – Sextus Empiricus Dec 26 '21 at 23:11

4 Answers4

15

Here's a simulated example of two prices that are very highly correlated ($\rho = 0.9875$). When you attempt to predict the price change in one using the lagged value of the other, very little of the variation in the price change is explainable:

. clear

. set seed 12092021

. set obs 102 Number of observations (_N) was 0, now 102.

. gen t = _n

. tsset t

Time variable: t, 1 to 102 Delta: 1 unit

. gen p1 = 1 + 3*t + rnormal(0,5)

. gen p2 = 3 + 2*t + rnormal(0,10)

. corr p1 p2 (obs=102)

         |       p1       p2

-------------+------------------ p1 | 1.0000 p2 | 0.9875 1.0000

. reg FD.p2 p1

  Source |       SS           df       MS      Number of obs   =       101

-------------+---------------------------------- F(1, 99) = 0.01 Model | .727541841 1 .727541841 Prob > F = 0.9436 Residual | 14322.4337 99 144.671048 R-squared = 0.0001 -------------+---------------------------------- Adj R-squared = -0.0100 Total | 14323.1613 100 143.231613 Root MSE = 12.028


   FD.p2 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- p1 | .0009672 .0136392 0.07 0.944 -.0260959 .0280303 _cons | 1.665843 2.420693 0.69 0.493 -3.137338 6.469024


. reg FD.p1 p2

  Source |       SS           df       MS      Number of obs   =       101

-------------+---------------------------------- F(1, 99) = 0.01 Model | .683934381 1 .683934381 Prob > F = 0.9171 Residual | 6210.52068 99 62.7325321 R-squared = 0.0001 -------------+---------------------------------- Adj R-squared = -0.0100 Total | 6211.20461 100 62.1120461 Root MSE = 7.9204


   FD.p1 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

-------------+---------------------------------------------------------------- p2 | -.0013704 .0131245 -0.10 0.917 -.0274123 .0246715 _cons | 3.260085 1.574913 2.07 0.041 .1351165 6.385054


Here FD is the first difference of subsequent value (so $FD.p_t = (p_{t+1}-p_t)$).

The $R^2$ (aka R-squared) of both models is around zero, so very little of the variation in price changes tomorrow can be explained by the price today. This illustrates the intuition that knowing what you know today, you cannot act on this correlation to make money tomorrow.

You can play around with variations on this approach (using the lagged price change as a predictor, non-linear models, adding more data, more noise, or adding trends), with identical results.

You might object that my toy example is flawed because the high correlation is contemporaneous, so if you knew p1 today, you could predict p2 today. I think that is wrong for the following reason. Suppose the DGP is as above, but unknown to you. You are an executive at company 1, and you learn that your CEO had been falsifying earnings and pinching bottoms. The news will become public shortly and lower p1. You can’t short your own stock without a vacation at Club Fed. Should you short the stock of company 2 if you know the correlation between p1 and p2 is ~1? I think that would be a terrible idea. This is what makes the correlation spurious and why that matters.

You could also have a causal relationship, but no correlation. When a house has air-conditioning with a preset desired temperature, there will be a strong positive non-spurious correlation between the amount of electricity used by the AC and the temperature outside. But there will be no correlation between the amount of electricity consumed and the inside temperature. The outside temperature and the inside temperature will also be uncorrelated. The last two are spurious non-correlations in my mind. But all three correlation are valid (though that has no formal definition in statistics) since a correlation is just a transformation of the data.

This is all to say that a strong correlation is not necessary for a causal dependence to exist. And it is certainly not sufficient. Even the sign on the causal relationship could be different from the sign of the correlation. This matters for using correlations to do things out in the real world (i.e., interventions). This is not just an issue with time series data, but can happen with observational data.

dimitriy
  • 35,430
  • 1
    Makes sense I think… so if two levels are highly correlated that still means we can have a horrible lagged relationship between the differences (a difference now correlated or not with a difference in the next period). So, is your point to instead of looking at correlation in this instance, maybe run a regression in differences with the independent variable lagged? – user10136297 Dec 10 '21 at 07:17
  • 6
    My point is that you can’t use the contemporaneous correlation when you make investment decisions, so it’s effectively useless even if it’s close to 1. – dimitriy Dec 10 '21 at 07:34
  • p2 | = -.0013704, This indicates r= .035 approximately. You conclude that the R2 (aka R-squared) of both models is around zero, so very little of the variation in price changes tomorrow can be explained by the price today. –  Dec 23 '21 at 01:24
  • When you attempt to predict the price change in one using the lagged value of the other, very little of the variation in the price change is explainable." roh squared = 1 indicates a perfect relationship contrary to what you say/conclude? It implies a very very high relationship. Lagged value of other variable does affect ! –  Dec 23 '21 at 01:30
  • 1
    I am not really following your argument. If you disagree with my answer, you should simulate your own data, perform the analysis that you deem appropriate, and post that as an answer here. – dimitriy Dec 23 '21 at 02:30
  • @dimitriy, I follow your argument/example but actually you do not give a clear reply to the question. It seems that your argument can be resumed as follow: correlation that not imply predictability is spurious. It is so? – markowitz Dec 26 '21 at 18:04
  • Correlation is neither sufficient nor necessary to establish a causal dependence or its direction. In my example, correlation is an artifact created by an omitted variable, time. But there is certainly predictability, in the sense that p1 predicts p2 well. What makes it spurious is there is no causal relationship. This is consistent with Wikipedia. The spuriousness matters because there is no way to use that correlation. – dimitriy Dec 26 '21 at 20:53
  • @markowitz I made an edit that clarifies my response above. I meant to write “you can’t even use the contemporaneous correlation when making investment decisions” in my second comment. – dimitriy Dec 26 '21 at 21:22
  • @markowitz: I think dimitriy's argument is focussing on future prediction of the stock price, as would concern an investor making investment decisions. He is giving a range of scenarios which illustrate where correlation (of the prices, or their changes, or lagged-changes, etc.) may or may not be useful in making these decisions. His main simulation shows that you can have stock prices that are highly correlated (~1) but where there is insufficient correlation in the lagged changes to be useful for using one to predict the future value of the other. – Ben Dec 26 '21 at 23:50
  • @Ben, I understood the example of dimitry, but the question seems me more general and do not boil down in trading strategies only. In the first draft dimitry do not said nothing about causality and I wanted to clarify if his consider a kind of spuriousness that is entirely separated from the concept of causality. From dimitry edit I understand that causality is its key argument. – markowitz Dec 27 '21 at 09:00
9

The whole notion of "spurious" correlation is easy to misinterpret. Correlation is correlation --- if it is estimated well (i.e., via a good estimator and with a reasonable amount of data) then we can confidently say that the correlation is such-and-such. Correlation is a statistical measure with an extremely weak interpretation --- it just measures the tendency for things to vary together (usually measured linearly), irrespective of the cause of this tendency. The only thing that can be spurious is if we go further than this and interpret the correlation in a way that is not justified. This can occur if a person uses correlation to infer a causal relationship between variables, or it can occur if a person uses marginal correlation to infer conditional correlation. In either case the larger inference can be "spurious" insofar as it does not follow from the correlation. As I've noted in another answer, I've always hated the term "spurious correlation" because it it is not the correlation that is spurious, but the inference to some stronger result. If it were up to me we would never use this term, and would instead just state what we actually mean --- e.g., "spurious inference of cause", "spurious inference to conditional correlation", etc.

Now, with that little rant out of the way, let me address your specific concern. Since you are only interested in describing the past statistical relationships between the stock prices (as you say in your comments), you can report the correlation, but it should come with a number of important caveats on interpretation. Firstly, you should note that strong correlation between time-series can occur even for purely deterministic series with no statistical variation, so it often does not reflect any stochastic dependency between the series. This is something that has been recognised in the statistical community for over a century (see e.g., Yule 1926 and see this related answer). Secondly, even if changes in the stock prices are correlated, the ability to predict one stock from the other will depend on the cross-correlation in changes in the stock prices at sufficient lag values to allow use of one series to predict changes in the other. In large part, analysis of stock prices is best done by looking at lagged cross-correlation of changes in price, rather than correlation of the price series themselves.

Ben
  • 124,856
  • 3
    +1 for the rant! – Alexis Dec 26 '21 at 02:40
  • @Ben, you said that 'correlation is correlation - if it is estimated well'. Spurious or not is matter of interpretation, causal or conditional etc. Later you speak about strong correlation between two deterministic time series, but this seems me precisely an example of unproperly use of correlation. This is not a correlation. This is a bit different of 'spurious correlation' in the sense you suggest. Agree? – markowitz Dec 26 '21 at 14:17
  • 1
    @Ben, crystal clear explanation. –  Dec 26 '21 at 23:01
  • @markowitz: Well, to take an example, consider the time-series pairs $(x_t,y_t) = (1,2), (2,4),(3,6),...,(T,2T)$, which are deterministic. If you compute the Pearson correlation of these data points you get a correlation of +1. Similarly, if you compute the true (second moment) correlation of the underlying distribution that puts probability mass $1/T$ on each pair, you also get +1. So in that sense, these series are perfectly positively correlated. There are many improper uses you could make of this, but those would be improper uses ---i.e., interpretations attributed to correlation. – Ben Dec 26 '21 at 23:54
  • 2
    ... So in that example, I would say that this is correlation, but there are certainly many spurious interpretations/uses you could follow up with. – Ben Dec 26 '21 at 23:56
  • From this example I understand that correlation can be spurious (improperly used) even if the concept of causality is entirely absent. (Even) This is what what you mean? – markowitz Dec 27 '21 at 09:16
  • @markowitz: Yes, I think that's how I would view it (though others might find that odd). – Ben Aug 11 '22 at 11:49
2

There are two concerns with correlations of time series

  • Correlation when causal relationship is absent. Correlation does not imply causation. An example is the correlation between ice cream sales and the death rate due to drawing. These two are both high in summer and low in winter and they correlate in time, but this is not due to a direct causal relationship between the two. In such case, if a causal relationship between two variables is inferred based on a correlation between two variables then people use the term 'spurious relationship' (the inference is not correct).

  • Correlation in a sample when statistical relationship for the population is absent. Another concern is that the correlation might be likely found in data, even in the absence of an underlying statistical relationship. Time series with autocorrelation have a tendency to go up/down for short periods of time and so they tend to correlate with each other within short windows of time. But, this correlation is not significant. Yes, if you would compute the significance assuming that the datapoints are independently distributed according to a bivariate normal distribution (for which you can compute the exact sample distribution for the correlation coefficient), then it will turn out to be significant, but that assumption of independence is not correct when the time series follow trends or are autocorrelated.

  • The spurious correlations that come from integrated series seems deal with your second argument. Indeed it seems me that causality argument have no place there, at least not necessarily. This is what you mean? – markowitz Dec 27 '21 at 09:22
  • @markowitz the question is a bit unclear to me (I am not sure whether the problem is about the first or second argument in my answer). I also agree with Ben that the term 'spurious correlation' is a bit unclear (it is not a common term, one might say that spurious correlation relates to my second argument, but often it is a misuse of the term spurious relation which is about the first argument). So what I mean to point out with my answer is that correlation of time series have two aspects and if one is not concerned with causality, then still there can be problems due to the second argument. – Sextus Empiricus Dec 27 '21 at 10:55
  • You lost me here: "Another concern is that the correlation might be likely found in data, even in the absence of a statistical relationship." A correlation is a statistical relationship. – Alexis Dec 28 '21 at 17:12
  • @Alexis, an observed correlation is only a relationship for the sample taken from a population. It might not need to be that the distribution for the variables from the population have a correlation. So this sentence needs to be understood from the perspective of inference about a population from a sample. – Sextus Empiricus Dec 28 '21 at 17:27
  • Ah! Yes, that makes sense. Of course, the usual inferential caveats apply. – Alexis Dec 28 '21 at 18:15
2

The problem with spurious relationships - in the narrow context of pair trading - is not even with causality. The problem is that the relationship doesn't hold out of sample. This means that when you actually start trading on the developed algorithm, you won't make any money. And that can be a little bit of an issue, right?

Aksakal
  • 61,310