1

I understand that homoscedasticity, constant variance of the error terms at each different X value, is a key assumption for linear regression. Assume we collected a single data sample $(X,Y)$. The scatterplot could look like either the figure on the left or the figure on the right. The left figure shows that the error term spreads at $X=40$ and $X=80$,two arbitrary select X values, are the same. That is homoscedasticity and it appears to hold for all other $X$ values. enter image description here

  • Here we are checking homoscedasticity (or lack of it) using a single sample from the population. In theory, if we collected 10 different samples from the same population, would the homoscedasticity continue to exist at $X=40$ and $X=80$ also within all the other 9 samples?

  • At the same $X$ value, for example $X=40$, would the error spread (variance) be approximately the same across the 10 different samples? I think so...

  • What if we took just a single huge sample containing the entire population? would the population scatterplot show constant variance across all $X$ values? I think so...

Thank you for any correction and validation!

Nick Cox
  • 56,404
  • 8
  • 127
  • 185

1 Answers1

1

It depends on how you approach this issue. Imagine we have a well-defined population and two variables ($X$ and $Y$).

library(tidyverse)

x <- sample(0:100, 100000, replace=TRUE) e <- rnorm(100000, mean = 0, sd = 5) y <- x + e pop <- data.frame(y,x)

If you are not familiar with R that is not a problem. You can think e as the random error (with normal distribution), there is a linear relationship between $X$ and $Y$, and we have homoscedasticity. Now, we sample 1000 respondents from this population, and plot $X$ and $Y$:

pop %>% 
  slice_sample(n=1000) %>%
  ggplot(aes(x,y)) + geom_point() + theme_minimal() 

This is similar to the plot on the left in your post but with more observations ($n$=1000). We can think this as an example of simple random sampling. So, if we sample randomly ten times (or if we sample from a large enough population ten different samples), we will get different values but similar distributions (again, assuming that our samples are large enough). You can run the above code and see for yourself. The answer to your first question is yes if the stated assumptions hold. But quite frequently, they do not and this could lead to unequal selection probabilities, for example:

pop %>% 
  mutate(prop = x^3/max(x^3)) %>%
  slice_sample(n=1000, weight_by = prop) %>%
  ggplot(aes(x,y)) + geom_point() + theme_minimal() 

The population is the same but we use unequal selection probabilities to sample from it (e.g., higher values have higher probability of selection). Hence, we end up with a funnel shape, not as clear as in your post but still visible.

Turning to your second question, imagine we take ten samples from our hypothetical population, and calculate the variance for $Y$ values at $X=40$ for each sample:

# This function will calculate variance of y at x=40 once.
pop_func <- function(val){
  df_hm <- pop %>% 
    slice_sample(n=1000) %>% 
    filter(x == val)

var(df_hm$y)
}

We can calculate variances for ten random samples from our population.

pop_list <- replicate(10, pop_func(), simplify=FALSE) samp_vardf <- do.call(rbind.data.frame, pop_list) names(samp_vardf)[1] <- "var" samp_vardf

    var

1 29.05200 2 58.93006 3 64.31047 4 29.64913 5 30.56311 6 33.86109 7 32.22681 8 14.07650 9 16.93287 10 19.24295

As you can see, the variances vary considerably across these ten samples. And this is expected with just ten sample variances (also note that few $Y$ values fall into $X=40$ for $n$=1000). So, the answer to your second question is, well, not necessarily. But there is an underlying distribution. If we take 1000 samples instead of 10 from this population, and plot the distribution of the sample variances, we can see that it approximates to chi-squared distribution (also see this question).

Now imagine you have access to population data, and it turns out to be quite similar to one we used in our examples (highly unlikely). Then, the answer to your last question is yes. But more realistically, we will have a single sample, and in that case we need to take into account assumptions, for example, on sampling. Please keep in mind that these examples are quite simple. Usually, we model more than two variables, and non-linear relationships.

T.E.G.
  • 2,332