0

According to this post, the expected correlation between the sampling distributions for the slope and intercept in OLS regression is given by E(Corr) = -E(X) / sqrt(E(X^2)).

Now, let's consider an experiment with N trials where I measure Y at relatively the same X values across all trials, ensuring that E(Corr) is essentially constant in this experiment. Subsequently, I fit a line to the results of each trial, resulting in a population distribution of slopes and intercepts with a sample size of N.

My initial assumption is that if X is constant, then the population distribution of slopes and intercepts should be correlated to the same degree as E(Corr); however, my results suggest otherwise.

My question is: Does the correlation between the slopes and intercepts of the population distribution have a statistical explanation similar to E(Corr), or could this result be a unique property of the dataset I am working with? In other words, are the slopes and intercepts of the population distribution naturally correlated or is this only true if certain conditions are met?

edit: here is some R code to demonstrate what I mean

set.seed(89112)
#Number of trials
n=10000
#x is constant across trials
x=c(1,10,30,60,100,200)
#pre-allocate matrix to save results
res=matrix(NA,nrow=n,ncol=2)
#run loop
for (i in 1:n) {
  #generate sample 
  y=1+rnorm(1,0,0.01)*x+rnorm(length(x))
  #fit linear model
  mod=lm(y~x)
  #save intercept
  res[i,1]=summary(mod)$coefficients[1,1]
  #save slope
  res[i,2]=summary(mod)$coefficients[2,1]
}
#Correlation between populations of slopes and intercepts
cor.test(res[,1],res[,2])
> -0.3677728

#Correlation between sampling distributions of slopes and intercepts -mean(x)/sqrt(mean(x^2)) > -0.7005973

  • Could you provide the evidence in your results that "suggest otherwise"? And could you clarify what you mean by "E(Corr)" in the sense of a "statistical explanation"? As far as theoretical results go, the basic one is that under the usual OLS assumptions, the slope and intercept estimates are uncorrelated if and only if you have centered the explanatory variable (to make its mean zero). – whuber Dec 13 '23 at 18:17
  • Re the code: you are not creating "sampling distributions." You are creating different models in each iteration. Delete rnorm(1,0,0.01)* and try it again. – whuber Dec 13 '23 at 19:01
  • @whuber Sure, I've appended R code to my initial question, illustrating the distinction between the correlation in the populations of slopes and intercepts and the correlation in their respective sampling distributions. The statistical explanation for E(Corr) is articulated in the post I linked, which explains why the correlation of the sampling distributions for the slope and intercept are equal to -E(X)/sqrt(E(X^2)). I am basically wondering if the same rationale applies in my case when interpreting the populations of slopes and intercepts. – Applesauce26 Dec 13 '23 at 19:05
  • Clearly not, because the post you reference concerns a single, fixed regression model. – whuber Dec 13 '23 at 19:08
  • @whuber I can clearly see that they are different from the results, but what I am trying to understand is why they are different if X is constant. – Applesauce26 Dec 13 '23 at 19:12
  • Because the models are not constant! You are introducing considerably more variation than exists in the sampling distribution from a single model. – whuber Dec 13 '23 at 21:29
  • @whuber thank you, I now see the difference between the correlation of the sampling distribution of a single model vs. the population of all models. Just so I am clear, if there is no statistical reason that the population of slopes and intercepts would be correlated if X is constant, what would it mean if they were correlated to some degree? Could a non-statistical mechanism cause such correlation, say if the intercept and slope both had physical meaning? – Applesauce26 Dec 13 '23 at 21:53
  • I don't follow that question. "Population of slopes and intercepts would be correlated" sounds like an assumption about hyperparameters in a Bayesian or hierarchical model, but that's not a statistical determination: it's a matter of what model you deem appropriate for your data and analysis. – whuber Dec 13 '23 at 22:14
  • @whuber Here's an example. Say I have measurements of light intensity at the same underwater depths at N locations. I fit a linear model at each location and end up with N sets of slope-intercept pairs that I find are correlated. These N sets are what I would refer to as "the population of slopes and intercepts." I would like to determine if this correlation can be explained by a physical mechanism. But before I go further, I want to make sure I am not missing some statistical explanation for the correlation (i.e., like how the sampling distributions of slopes and intercepts are correlated) – Applesauce26 Dec 13 '23 at 22:44
  • That helps: consider including the context in your question. I wonder, though, why you are studying some obscure, indirect statistic like this correlation rather than something more direct and interpretable. (The correlation depends strongly on the origin of your coordinate system and therefore might not even have a physical meaning.) Is there some theory that suggests studying this correlation? – whuber Dec 13 '23 at 22:47
  • The origin in this case is the surface, so in some sense it does have a physical meaning. I am interested in this correlation because if it is real, then it suggests that values at the surface contain some information about the values at depth. My original goal was not to study this correlation, I simply stumbled upon it and now I am trying to retroactively understand what it might mean and if it has a statistical or physical explanation. – Applesauce26 Dec 13 '23 at 23:09
  • It's hard to see what kind of information that would be. The slopes and intercepts are guaranteed to be negatively correlated from the mere fact that all depths will have the same sign. If you want to study how surface values are associated with values at depth, why not do that directly? – whuber Dec 13 '23 at 23:16
  • @whuber Would it be reasonable to test this by centering the data, finding the slope then comparing that slope to the measured value at the surface? Surely if there is a negative relationship between these two values it would suggest that the change with depth is dependent on the initial value, correct? – Applesauce26 Dec 14 '23 at 15:30
  • I don't see how that follows at all. By observing how some statistic varies with depth it's unclear how one could conclude that the statistic "is dependent on the initial value" in any meaningful sense of that phrase (which is ambiguous: does that mean the initial value determines all other values or not?). – whuber Dec 14 '23 at 15:39
  • @whuber what if the mean of the statistic (in this example, light intensity) monotonically decreased with depth and its variance also decreased with depth, such that depth profiles of light intensity starting at large initial values tended to have larger negative slopes. Forget whether or not this is physically possible, how would one tell statistically if this sort of phenomenon were observed? – Applesauce26 Dec 14 '23 at 15:59
  • It's hard to tell precisely what you're describing, but it sounds like you posit a regression model in which both the conditional expectation of the response (light intensity) and its conditional variance are modeled. It's unclear how this might be related to "large initial values tended to have larger negative slopes," because one does not imply the other. I suspect that a fuller description of your data and (if possible) a mathematical statement of your model might clarify your situation. – whuber Dec 14 '23 at 18:24

0 Answers0