Correlation of X/Z with Y vs X with YZ

Question

Let’s say that $X$, $Y$ and $Z$ have some fat tailed distribution, $X$ and $Y$ are stock returns, i.e., the sample set can have both positive and negative values. While $Z$ is some random variable such that it has a positive skew but $Z$ always assumes positive values. Then would it be fair to assume: $$Cor(X/Z, Y) = Cor(X, YZ)$$

Using actual data one of them is positive while other is negative. How is that possible?

Edit / Extension

Both the answers by @Tim and @SextusEmpiricus are good validation of my empirical observation above. However, I do not fully appreciate it yet.

Say I fit following two linear regression models:

$$X = \beta_1 * Y * Z$$

$$X / Z = \beta_2 * Y$$

Let’s say second model, i.e., $\beta_2$ is a better model with higher $R^2$ then I can get better estimate of $\hat{X}$ using $Z * \beta_2 * Y$ than say $\beta_1 * Y * Z$?

One problem is that you haven't fully specified the models, because you forgot the error terms. If the first one is $$X = \beta_1 Y Z + \epsilon,$$ (with the usual assumptions on $\epsilon:$ namely, zero-mean, independent, and of common variance) then the second one must be $$X/Z = \beta_2 Y + \epsilon/Z.$$ The latter won't even be defined when $Z=0,$ but even worse, the variances of the errors now vary with $Z.$ — whuber, Jul 28 '23 at 20:39

Tim · Answer 1 · 2023-07-28T09:11:40.607

6

Why would that hold? You state nothing about the relation between $X$ and $Z$ and between $Y$ and $Z$. The correlation would influence the results. To convince yourself, create a synthetic dataset where the variables are positively or negatively correlated and check what happens. TL;DR the equality does not hold, the sign of the correlation can change.

> set.seed(42)
> N <- 100
> X <- rnorm(N)
> Y <- rnorm(N)
> Z <- rnorm(N)
> cor(cbind(X,Y,Z))
##             X          Y           Z
## X  1.00000000 0.03127984 -0.14477734
## Y  0.03127984 1.00000000  0.07122699
## Z -0.14477734 0.07122699  1.00000000
> cor(X/Z, Y)
[1] 0.05534386
> cor(X, Y*Z)
[1] -0.202128

Notice that even if you increase the $N$ to something large (like 100k samples), the correlations between the variables would be close to zeros (they are independent, so the correlation in the smaller sample was by chance), but still the equality won't hold.

edited Jul 28 '23 at 09:11

answered Jul 28 '23 at 09:00

Tim

138,066

Since theoretically both correlations are equal, this is not a true counter-example. In a general case, both correlations are equal when $Z$ is constant. – Xi'an Jul 28 '23 at 10:06
@Xi'an agree, and I even mention in my answer that in the example the empirical correlation is by pure chance. The example is just to show that it is rather trivial to produce a counterexample. With true non-zero correlation, this would also hold, but I feel like a counterexample is enough here. – Tim Jul 28 '23 at 10:12
Thanks @Tim. $Z$ is very close to $Stdev(Y)$ – Gerry Jul 28 '23 at 13:43
@Gerry I'm not sure what you mean by that. The standard deviation of $Y$ is a constant and it tells us noting about how they are correlated. – Tim Jul 28 '23 at 13:46
Its a rolling timeseries of stdev of Y. If we think of $Y_i$ as return of multiple stocks then $Z_i$ is a stdev of rolling timeseries of $Y_i$. Here, $Y$ is collection of return of $n$ stocks $0<i<=n$ – Gerry Jul 28 '23 at 14:07
@Tim the simulation validates the empirical outcome using normal RVs. However, I am not quite sure if I fully appreciate it. Say I run a linear regression to estimate $X$ using $Y$ and $Z$. Does this mean $b_1 \neq b_2$ in following two models: $X = b_1 * Y* Z$ and $X / Z = b_2 * Y$ ? – Gerry Jul 28 '23 at 16:45
Let’s say $b_2$ is a better model with higher $R^2$ then I can get better estimate of $\hat{X}$ using $Z * b_2 * Y$ than say $b_1 * Y * Z$? – Gerry Jul 28 '23 at 17:05
@Gerry it's a different question, but the two regression models are different—you omitted the error term in the model, if you add it you will see they're not equivalent. – Tim Jul 28 '23 at 17:50
Let us continue this discussion in chat. – Gerry Jul 28 '23 at 18:13
Thanks @Tim . Agreed about error term. My point is that it is possible to improve model for $\hat{X}$ by just moving variables around and/or changing the order of regression. It’s a separate question but that is how I got this observation. – Gerry Jul 28 '23 at 18:29
@Gerry it may be possible but those would be different models that may not be equivalent. – Tim Jul 28 '23 at 18:35

Sextus Empiricus · Accepted Answer · 2023-07-28T20:17:13.747

Using actual data one of them is positive while other is negative. How is that possible?

This is not an example with fat tailed distributions, but nevertheless it will provide an intuition how this can happen.

I started generating variables $U = X/\sqrt(Z), V = Y\cdot \sqrt(Z)$ according to a joint distribution which has both an upward and a downward trend, and I define a value $Z$ that correlated with $U$. I made this such that the upward trend occurs for small values of $Z$ and the downward trend occurs for large values of $Z$. Then depending on whether we multiply or divide by $\sqrt{Z}$ we increase/decrease the two different regions.

R-code:

set.seed(1)
n = 10^4
u = runif(n)                 # x/sqrt(z)
v = u(1-u) + runif(n,0,0.1) # ysqrt(z)
z = 0.5+u + runif(n,-0.1,0.1)
x = u*sqrt(z)
y = v/sqrt(z)
plot(x/sqrt(z),y*sqrt(z),
     main = paste0("correlation = ",round(cor(u,v),2)), 
     ylim = c(0,0.4), pch = 20, cex = 0.5, col = rgb(0,0,0,0.02))
plot(x/z,y,
     main = paste0("correlation = ",round(cor(x/z,y),2)), 
     ylim = c(0,0.4), pch = 20, cex = 0.5, col = rgb(0,0,0,0.02))
plot(x,yz,
     main = paste0("correlation = ",round(cor(x,yz),2)),
     ylim = c(0,0.4), pch = 20, cex = 0.5, col = rgb(0,0,0,0.02))

Say I fit following two linear regression models:

$$X = \beta_1 * Y * Z$$

$$X / Z = \beta_2 * Y$$

Does this mean $b_1\neq b_2$

With that view you might think that the regression model should not change if you divide both sides with some variable and that $b_1$ and $b_2$ should be the same, but what you are changing with division/multiplication is not just the model for the mean, and it is also the residuals that changes.

Similar concepts from other questions are

When you fit $Y = a_1 + b_1 X$ or $X = a_2 + b_2 Y$ then you do not simply get $b_1 = 1/b_2$. See: Effect of switching response and explanatory variable in simple linear regression
Linearized regressions like $Y = a_1 \cdot \exp(b_2 X) + \epsilon$ and $\log(Y) = \log(a_1) + b_2 X+ \epsilon$ are not the same. See: Differences between approaches to exponential regression

The multiplication/division in the above example is effectively changing the weights of the data points in the regression.

With the above example the following two regressions give the same result with a slope of 0.03989

intercept = rep(1,n)
intercept2 = intercept/z
y2 = y*z
x2 = x/z
lm(y2 ~ intercept  + x  + 0)
lm(y  ~ intercept2 + x2 + 0, weights = z^2)

Note that in this case we also need to change the intercept and not just the x variable.

Thanks @SextusEmpiricus the above validates my empirical observation using simulated RVs. However, I am not quite sure if I fully appreciate it. Say I run a linear regression to estimate $X$ using $Y$ and $Z$. Does this mean $b_1 \neq b_2$ in following two models: $X = b_1 * Y* Z$ and $X / Z = b_2 * Y$ ? — Gerry, Jul 28 '23 at 16:50
Let’s say $b_2$ is a better model with higher $R^2$ then I can get better estimate of $\hat{X}$ using $Z * b_2 * Y$ than say $b_1 * Y * Z$? — Gerry, Jul 28 '23 at 17:04
@Gerry with that view you might think that the regression model should not change if you divide both sides with some variable and that $b_1$ and $b_2$ should be the same, but what you are changing with division/multiplication is not just the model for the mean, and it is also the residuals that change. Similar questions are https://stats.stackexchange.com/questions/20553/ and https://stats.stackexchange.com/a/550757/164061 (example: regressions $Y = a \cdot \exp(bX) + \epsilon$ and $\log(Y) = \log(a) + bX+ \epsilon$ are not the same) — Sextus Empiricus, Jul 28 '23 at 17:21
thanks @SextusEmpiricus for sharing your thoughts. I re-learned the value of error. Basically if my objective is getting the “best model” for explaining $X$ using $Y$ and $Z$ then I need to try all permutations and transformations to get best insample (and hopefully out of sample) $R^2$ — Gerry, Jul 28 '23 at 17:56

Correlation of X/Z with Y vs X with YZ

2 Answers2

[1] 0.05534386

[1] -0.202128