3

I am reading Bain/Engelhardt's "Introduction to Probability and Math Statistics" about maximum likelihood estimate (MLE) for linear regression (p. 519 - 522).

I first summarize 3 key points from the textbook that I am interested in.

(1) The MLE for $ \hat{\beta} = (X^TX)^{-1}X^TY $ and $ \hat{\sigma}^2 = \frac{(Y-X\hat{\beta})^T(Y-X\hat{\beta})}{n} $. For this part, I totally get it.

(2) Further, the book mentions the following: $\tilde{\sigma} = \frac{(Y-X\hat{\beta})^T(Y-X\hat{\beta})}{n-p-1} $ is the UMVUE of $\sigma^2$. Okay, for this, I understand that it is unbiased estimate. So no problem.

(3) Furhter, it says the following: $T=\frac{\hat{\beta_j}-\beta_j}{\sqrt{\tilde{\sigma}^2 a_{jj}}} \sim t(n-p-1)$ where $ \{a_{jj} \}$ is the number on the diagonal of $(X^TX)^{-1}$ corresponding to $\beta_j$. Basically, this can be used to do hypothesis test. For $H_0: \beta_j = \beta_{j0}$, we reject it if $|t|> t_{1-\alpha/2} (n-p-1)$.

If I understand correctly, typically we assume $\beta_{j0} =0$, and thus we can calcuate t-statistic as $t(n-p-1)=\frac{\hat{\beta_j}}{\sqrt{\tilde{\sigma}^2 a_{jj}}}$.

OK, my two questions as follows.

(1) If we use MLE to estimate regression coefficients for linear regressions, it seeems we can use t-test to evaluate the significance level, right? It seems yes based on this textbook. If so, what is the degree of freedom for this t-test, it seems it should be n-p-1 (p not including intercept). Thus, for simple linear regression, it will be n-1-1=n-2. Correct?

(2) We know that for MLE, it typically also estimates the the variance for the noise term, $ \hat{\sigma}^2$. I know it counts as one parameter. Does it consider use one more degree of freedom? If so, why no need to use n-3 in the t-test for regression coefficcient t-test? Is it because $\beta $ and $ \sigma^2 $ are independent, and thus the estimate of $\sigma^2$ does not impact the df of t-test for regression coefficients.

Thank you so much! Look forward to any feedback and helps.

Added Content:

Even though I added it at the comment section in Rachel's answer, I think it is better to put it here as here allows code colors.

In particular, there is another post, link below, about the degree of freedom in MLE for linear regression.

What does the degree of freedom (df) mean in the results of log-likelihood `logLik`

> m <- lm(mpg ~ hp, data = mtcars) 

logLik(m) 'log Lik.' -87.61931 (df=3)

Regarding the R code output shown above, the following is my additional question, namely question (3):

(3) I understand the t-test's df for regression coefficients is n-p-1. Thus, for simple linear regression, it will be n-2. If so, why does logLik in R return df=3? Is it because 3 here means 3 parameters, and not necessarily 3 df per se? Thank you.

To answer the (3) question by myself, based on the discussion with Rachel and others (see all the comments under this main question and under Rachel's answer):

logLik(m) returns 3 df means that it estimates 3 parameters(intercept, slope and variance). However, since the estimation of $\sigma^2$ is based on a formula with estimated intercept and slope, it does not cost 1 more df. Thus, the actual df in t-test for regression coefficients is still n-2. (2 represents one df for intercept and one df for slope, in the context of simple linear regression.)

Will
  • 53
  • 1
    Hi: The answer to the first question is yes. The answer to the second question is that the estimation of $\sigma^2$ doesn't take away a degree of freedom because, once the $\hat{\beta}$ are known, the $\hat{\sigma}^2$ is known because it's only a function of the estimated residuals. Note that $\bar{X}$ and $s^2$ are independent when normality is assumed but $\beta$ and $\sigma^2$ are not independent. – mlofton Jul 17 '23 at 01:55
  • @mlofton thanks for your answer. I got your first answer. But, for the second part, I am not sure. In the textbook (p. 521), it mentions $ \hat{\beta}$ and $\hat{\sigma}^2$ are independent. Are you suggesting the relationship between estimated $ \hat{\beta}$ and $\hat{\sigma}^2$ is different from the relationship between $ \beta $ and $\sigma^2$? Again, thank you! – Will Jul 17 '23 at 03:27
  • Re. your point (3): You should correct your SE in the denominator. The term $(X'X)^{-1}$ is a matrix; you need to extract the diagonal entry corresponding to $\beta_j$. – Rachel Altman Jul 17 '23 at 06:02
  • @RachelAltman You are right. I just corrected it. Thank you! – Will Jul 17 '23 at 11:58
  • Hi Will: Someone can correct me but, yes, $\hat{\beta}$ and $\hat{\sigma^2}$ are independent which allows one to use the t-test. ( normal rv divided by chi squared rv ). I'm not sure one can talk about the independence of $\beta$ and $\sigma^2$ because they aren't random variables but rather parameters. So, the notion of statistical independence doesn't apply. – mlofton Jul 17 '23 at 16:10
  • @mlofton thanks for your answer. Very helpful that you point out the difference between parameters and estimates from a sample. Can you look at my comment at the answer section for the meaning of "logLik" in R, if you have any insights? Again, thank you! – Will Jul 17 '23 at 17:53
  • You are mixing up models and tests. logLik isn't testing a specific coefficient in your application of it. It's performing a likelihood ratio test of the overall regression, which indeed involves three parameters. Moreover, it is using an asymptotic chi-squared approximation to the log likelihood: those df refer to the chi-squared distribution and not to the Student t distribution used in Ordinary Least Squares to test individual coefficients. – whuber Jul 17 '23 at 18:13
  • @whuber: I disagree. logLik simply provides the maximum value of the log likelihood function (as per the documentation). It's not performing a test. – Rachel Altman Jul 17 '23 at 18:21
  • @Rachel Thank you for the correction. The documentation, though, agrees with my interpretation of the df value as "the number of (estimated) parameters in the model." That is provided at least in part to perform the likelihood ratio test I described but it's not relevant for the tests of individual coefficients. – whuber Jul 17 '23 at 18:36
  • @whuber: I agree that what logLik calls df is the number of estimated parameters. But the "overall regression" test would just be a test of $H_0: \beta_1=\cdots=\beta_p=0$. The distribution of the LRT statistic under $H_0$ would be approximately $\chi^2_1$ (not $\chi^2_{p+2}$). – Rachel Altman Jul 17 '23 at 18:57
  • @Rachel It depends on your null hypothesis, but wouldn't the df for the test you give have $p$ df? The 1 df case concerns only tests of individual coefficients. – whuber Jul 17 '23 at 19:10
  • @whuber: Yes -- thank you. I was writing to Will in the context of simple linear regression below and was still thinking of that setting when I wrote to you! You are correct. – Rachel Altman Jul 17 '23 at 19:16
  • Thank you, both whuber and Rachel. The reason I mentioned logLik was to understand what it meaned df = 3. To be honest, I do not really follow why likelihood ratio test is being discussed here. When thinking about chi-square (or, especially -2 log likelihood) I typically think it as a way to compare models, especially in the context of linear mixed models comparisons, as a way to determine whether we need to add more random variables. – Will Jul 17 '23 at 19:17
  • @whuber thank you! Just a quick question: you mentioned "Student t distribution used in Ordinary Least Squares to test individual coefficients." OLS and MLE for linear regressions give exactly same estimated coefficients. Further, OLS actually does not assume anything about error terms when estimating coefficients. However, when people talking about p-value in OLS, it assumes error terms to be normally distributed, same as MLE. Thus, OLS and MLE actually lead to not only the same coefficient estimates, but also t-test values (same t test formulas). Am I correct? – Will Jul 17 '23 at 19:30
  • Almost, but not quite. (1) The MLE estimate of $\sigma$ differs from the OLS estimate. (2) The standard MLE tests are asymptotic ones relying on the Normality of the sample distribution of the parameter estimates, and therefore don't ordinarily use the Student t distribution for computing p-values. (3) The error terms in OLS don't have to be Normally distributed, but they shouldn't depart too far from that. Many simulations and studies indicate that skewness of the error distribution is problematic, but many other apparent departures from Normality are anodyne. – whuber Jul 17 '23 at 20:06
  • @whuber: Thanks for the comments. I guess I need to run into this rabbit hole, bear with me. Just a quick question first, and I might come back with other questions: You said MLE tests "don't ordinarily use the Student t distribution for computing p-values." How do MLE tests calculate p-values then? If you look at Rachel's answer and the 3rd point mentioned in my original question (which is from Bain's book, p. 522), t-test is used to provide p-value for the estimated regression coefficients. – Will Jul 17 '23 at 23:02
  • 1
    There are several ways. A common one is a Wald test, where your question is discussed at https://stats.stackexchange.com/questions/115360. Other tests are the likelihood ratio test and a "score test." All rely on Normal approximations to the distributions of the estimates or of the log likelihood when the number of observations is large. Using a Student t distribution to test MLEs can be done, but it's generally an ad hoc procedure that is foreign to the spirit and theory of maximum likelihood estimation. – whuber Jul 17 '23 at 23:10
  • @whuber thank you for the input. I will go to read some book and posts more and will come back later. – Will Jul 18 '23 at 12:33

1 Answers1

2
  1. No. The significance level is always chosen by the practitioner in advance of conducting the hypothesis test. It is the chosen probability of making a Type I error. But yes, you have specified the df correctly.

  2. When $\sigma^2$ is known, $T\sim N(0,1)$. Otherwise, letting $a_j^2$ be the $j^{th}$ diagonal entry of $(X'X)^{-1}$, $$\begin{eqnarray*} T &=& \frac{\hat{\beta}_j-\beta_{j0}}{\tilde{\sigma}a_j} \\ &=& \frac{\frac{\hat{\beta}_j-\beta_{j0}}{\sigma a_j}}{\frac{\tilde{\sigma}a_j}{\sigma a_j}} \\ &=& \frac{\frac{\hat{\beta}_j-\beta_{j0}}{\sigma a_j}}{\sqrt{\frac{\tilde{\sigma}^2}{\sigma^2}}} \end{eqnarray*} $$ The numerator is distributed as $N(0,1)$ and is independent of the denominator. The square of the denominator is distributed as $\chi^2$ on $n-p-1$ df. By definition, this ratio is distributed as $t$ on $n-p-1$ df.

  • Thank you so much Rachel! For your first part of the answer, I understand. However, for part 2, what you said makes sesne; but, are you suggesting that the estimtation of $\sigma^2$ does not use one more degree of freedom? – Will Jul 17 '23 at 13:11
  • No. When $\sigma$ is known, $T$ has a normal distribution, i.e., the notion of df doesn't apply. When we estimate $\sigma$ with $\tilde \sigma$ (note my correction in my answer...originally I wrote $\hat\sigma$), the distribution of $T$ is $t$ with df determined (by definition) by the df of the $\chi^2$ random variable in the denominator ($n-p-1$, in this case). – Rachel Altman Jul 17 '23 at 15:44
  • Hi Rachel: Thank you for your clarification. Yes, now I understand that the estimation of $\sigma^2$ does not cost 1 df, since it is from the a function of $\hat{\beta}$. However, in R, if you run "m <- lm(mpg ~ hp, data = mtcars) > logLik(m)," it will return 'log Lik.' -87.61931 (df=3)." Is it because 3 means parameters (i.e., intercept, slope, and variance), and not necessarily 3 df, per se? – Will Jul 17 '23 at 17:58
  • logLik calls the number of estimated parameters "df" (see documentation). But I don't see why it uses that label. The relevant df (i.e., those associated with the denominator of the expression I wrote above) are correctly listed as $n-2=30$ in the summary output (summary(m)): "Residual standard error: 3.863 on 30 degrees of freedom". – Rachel Altman Jul 17 '23 at 18:13
  • Thank you, Rachel. I got it now. – Will Jul 17 '23 at 19:37
  • Will: As Rachel pointed out, I would ignore the 3 df in the output of logLik. df must have a different meaning there than the one we have been discussing. Thanks to Rachel and whuber for their contributions which were way better than my initial one. – mlofton Jul 18 '23 at 02:10
  • Will: One other thing that I meant to say but forgot. I wouldn't concern yourself with the case where $\sigma^2$ is known so that the test statistic is normally distributed. There are two reasons for this. 1) $\sigma^2$ is never known. If someone claims that they know it, they're lying. 2) For reasonably large $n$ (i.e. almost the df ), the t-distribution approaches the normal so using the t distribution and assuming $\sigma^2$ is unknown, never hurts. – mlofton Jul 18 '23 at 04:15
  • Hi mlofton thank you for the great input! @rachel: I read some posts saying that lm() uses OLS. Assuming that both OLS and MLE use t-test to test significance level of regression coefficients, will they get the same t statistic value? (Maybe I miss something, but I could not find an existing function in R to directly specify using MLE for linear regression.) – Will Jul 18 '23 at 12:47
  • Will, because MLE estimates $\sigma$ differently than OLS, it does produce a (slightly) different value of $t$ for the tests of the coefficients. Moreover, for the same reason, $t$ no longer has a Student $t$ distribution (a multiple of $t$ might have a Student $t$ distribution). Thus, the correct way to conduct a t-test with such a model is to use the OLS estimates, not the MLEs; and when you are using the MLEs, you would conduct a so-called Z-test (that is, compute the p-values using the standard Normal distribution). – whuber Jul 18 '23 at 13:08
  • Hi @whuber: I am going to accept Rachel's answer first, since we basically agree that (1) the $\sigma^2$ does not use 1 more df and (2) Loglik = 3 df means 3 parameters. I probably go to open another thread/question to compare OLS and MLE on linear regression, as it is a slightly different question than this one. Again, thank you! – Will Jul 18 '23 at 16:28
  • @whuber I asked it as a new question. If you can provide some feedback or feedback there, I would really appreciate it. The link: https://stats.stackexchange.com/questions/621763/are-t-statistic-formulas-and-values-for-linear-regression-coefficients-the-same – Will Jul 19 '23 at 14:22