3

$y = f(x,...)$

$y$ represents the total effort in hours to complete a mix of tasks (each $x$ is a different mix) of varying complexity. It might be a non-linear relationship, but for now, I only have the count of tasks. No complexity data. I'd like to understand the correlation between total effort and task count.

A simple regression throws this up:

> summary(lm(data=dat, y ~ x))

Call: lm(formula = y ~ x, data = dat)

Residuals: Min 1Q Median 3Q Max -2912.84 -189.26 12.88 148.09 3138.23

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -148.09 102.96 -1.438 0.155
x 146.76 13.89 10.568 1.69e-15 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 724.2 on 62 degrees of freedom Multiple R-squared: 0.643, Adjusted R-squared: 0.6373 F-statistic: 111.7 on 1 and 62 DF, p-value: 1.687e-15

I re-ran it with a forced zero intercept.

It now says:

> summary(lm(data=dat, y ~ 0 + x))

Call: lm(formula = y ~ 0 + x, data = dat)

Residuals: Min 1Q Median 3Q Max -2775.5 -287.4 -118.5 0.0 3389.7

Coefficients: Estimate Std. Error t value Pr(>|t|)
x 137.24 12.31 11.15 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 730.3 on 63 degrees of freedom Multiple R-squared: 0.6635, Adjusted R-squared: 0.6582 F-statistic: 124.2 on 1 and 63 DF, p-value: < 2.2e-16

Question: Which of the two above is a better fit? Should the Residual Standard Error* (RSE) be the guide? Or should I simply go with the second model because a zero-intercept makes sense in this case?

Note: The post shared by Nick Cox states that $R^2$ must not be relied upon for zero-intercept cases. I don't fully understand the math described there but I'm hoping that it is okay to rely on the Residual Standard Error in a zero-intercept scenario. Could someone please confirm this as well?

*Reference: This article says RSE is also a goodness-of-fit measure.

  • 4
  • 1
    FAQ, as above. Searching the site for "r-square intercept" pulls up many such questions. – Nick Cox Oct 18 '22 at 09:01
  • 1
    It's a fair point of view that statistical software just shouldn't do this, but it does -- which has puzzled many smart people, but which is also widely explained. – Nick Cox Oct 18 '22 at 09:15
  • @NickCox, Thank you. It's closely related to my question which, however, is about the removal of an insignificant intercept. – ottodidakt Oct 18 '22 at 10:02
  • 2
    I still don't see a different question here, and in any case there are many such threads. But I think the general statistical reaction is simple and unanimous. Never, ever remove an insignificant intercept unless there is an overwhelming rationale why a model $y = bx$ makes better scientific (economic, medical, engineering, whatever) sense than $y = a + bx$. But advice also depends on knowing what the data are (instead of anonymous $y$ and $x$) and on seeing a scatter plot. – Nick Cox Oct 18 '22 at 12:25
  • @NickCox, Thanks. I agree that a proper rationale is necessary to remove the intercept. I do have it in my case. I just want to spare the community the details of my domain. BTW, I edited my post after going through the link you shared. I now have a follow-up question at the end (which might lack context as a separate question, hence tacked on). – ottodidakt Oct 18 '22 at 13:00
  • I read the edit too. I don't regard RSE as helping the decision at all. I would need to know how it's calculated in R and wouldn't trust it without knowing that. I don't use R routinely or expertly, so I would want to look at the data in other software first. If you want to omit domain knowledge, that's your choice but it doesn't make advice easier. Can you show a scatter plot? Sometimes there is a good case for $y = ax^b$, for example. – Nick Cox Oct 18 '22 at 13:09
  • @NickCox, Thanks again. It's kind of you to continue to respond. 'y' in this case represents the total effort in hours to complete a mix of tasks (each 'x' is a different mix) of varying complexity. So yes, it might be a non-linear relationship as you suggest, but for now, I only have the count of tasks. No complexity data. – ottodidakt Oct 18 '22 at 13:53
  • 2
    I agree that (0, 0) is a natural origin here, but linearity might be improved on. For example, regardless of whether it's people, organisms or machines there might be aging or tiredness or wear effects, or gains from experience, etc. – Nick Cox Oct 18 '22 at 14:09
  • "Never, ever remove an insignificant intercept unless there is an overwhelming rationale" Nick Cox, in principle the intercept has not much more value than any other predictor variable. It is just in practice that the intercept is very useful and part of a model and that is why we put it as a standard in models. However, there are cases where a model might have a (small) non-zero intercept but for practical reasons we do not fit the intercept (in a similar way as we do not always add all coefficients). I argue for that more thoroughly here: https://stats.stackexchange.com/a/588739/164061 – Sextus Empiricus Oct 28 '22 at 10:57
  • @SextusEmpiricus So, OK, you think there can be an overwhelming rationale, namely simplicity. I will just add that I have seen many people removing intercepts without really thinking through the advantages and limitations of that or even understanding quite what they are doing, but that can be true of just about any other statistical decision. – Nick Cox Oct 28 '22 at 12:07
  • @NickCox my argument is that we do not only need a rationale for 'y = bx' making more sense than 'y = a+bx'. Even when 'y = a+bx' makes sense then we can still decide to not use it because the value of 'a' is very small. This happens for instance in chemometrics when an experimenter decides to not fit a baseline even when a (small) baseline can be theoretically present. – Sextus Empiricus Oct 28 '22 at 15:58
  • If you want an r-squared like measure of fit, for a model without an intercept, you might use a version of Efron's pseudo r-squared. It's defined here ( stats.oarc.ucla.edu ). With the caveat, that I wrote the former, there's an implementation in R in the rcompanion package, and in the performance package. A = c(1,2,3,4,5,6,7,8,9); B = c(2,4,5,3,6,5,7,9,8); model = lm(B ~ 0 + A); summary(model); library(rcompanion); accuracy(list(model)); library(performance); r2_efron(model) . – Sal Mangiafico Oct 28 '22 at 16:09
  • For an OLS model with intercept, this Efron's pseudo r-squared will equal r-squared. – Sal Mangiafico Oct 28 '22 at 16:10
  • @SextusEmpiricus Your distinction seems minute to me. I use "makes sense" in an informal way to capture anything and everything from science to statistics to context about what functional form is best or good to use, and preference for one rather than another can arise in all kinds of ways. – Nick Cox Oct 28 '22 at 16:24
  • @NickCox you may have meant your 'never, ever...' statement in a relatively weaker sense. But it might not be so clear. All around we have much more strict statements; for instance, the accepted answer to the question that I linked to states the following "The shortest answer: never, unless you are sure that your linear approximation of the data generating process (linear regression model) either by some theoretical or any other reasons is forced to go through the origin" and speaks specifically about the data generating process or about 'forced' and not about statistical arguments. – Sextus Empiricus Oct 28 '22 at 16:37

3 Answers3

2

While this remains open (full disclosure: I voted to close) I will offer another answer, which is (to start negatively) that the question is hard and unclear, as it mixes together various general issues with distinctly limited specific details arising from the analysis of one dataset. I will also try to bring together various comments already made.

So far as I can see, nothing in the question really illuminates quite what is best or better as an analysis for the dataset being alluded to or for other datasets on the relation between (if I understand this correctly)

number of tasks completed as a predictor

and

total effort as an outcome.

In abstraction, it does seem that zero effort would imply zero achievement as a limiting condition, but that said

  1. The relevance of this as a constraint is an open question, but even if it seems compelling, it still allows many different functional forms, starting with $y = bx$, $y = bx + cx^2$, $y = ax^b$, and so on. Yet broader is a possibility that no simple algebraic specification will work well.

  2. To advise well on what to do with the data, we need to see the data, which is in principle easy for a small dataset. (I asked in comments for a scatter plot, but a listing would be even better.)

The characteristic terseness of lm output from R is defensible on various grounds but it is not sufficient to advise here on which model makes most sense (substantively, scientifically) and fits the data well, which in my view should be the main questions here.

The idea that a single summary criterion should be sought to make a decision for you seems utterly misguided here, and in general.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
2

Answer from my comments

For a model without an intercept, if you want an r-squared-like measure of fit, you might use a version of Efron's pseudo r-squared. It's defined here:

With the caveat, that I wrote the former, there's an implementation in R in the rcompanion package, and in the performance package.

As an example:

A = c(1,2,3,4,5,6,7,8,9)
B = c(2,4,5,3,6,5,7,9,8)

model = lm(B ~ 0 + A)

summary(model)

### summary() reports an r-squared of 0.955.
###  This is probably not a desirable statistic to use
###   for a linear model without an intercept.

library(rcompanion)

accuracy(list(model))

Efron.r.squared

0.67

library(performance)

r2_efron(model)

0.6704986

For an OLS model with an intercept, this Efron's pseudo r-squared will be equal to r-squared.

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35
1

Your question is a bit of a mixture of issues but the key note sounds like 'In model selection, does the intercept have a special role?'.

  • The intercept is in principle not different from other regressor variables.

    It has no special role that makes that you shouldn't remove it any less in comparison to similar situations with other regressor variables.

  • However, in practice we often keep the intercept or get an implicit intercept.

    Either an intercept occurs in the model anyway because we add a categorical variable (which implicitly adds an intercept even when it wasn't there before)

    Or an intercept is common in many models and it is included as a standard as a sort of prior knowledge.

In relationship to the R² value. That measure often relates to a comparison of the variance by expressing it as a decomposition of the variance into different parts. This makes in particular sense when the models are nested. But, if a model has no intercept then it is not anymore nested. This is because the baseline, the total variance, is the computed relative to the mean of the data, and it is a model that contains by definition an intercept term.


Another part of your question is:

Which of the two above is a better fit?

But that is a very broad topic (even without thinking about the intercept) and the question is asked not very clearly.

How do you determine what is better? It depend on the situation. What is your goal, how do you define better?

AIC

If you have a reasonable assumption about the error distribution, then one way to decide on the better test could be the use of the Akaike information criterion. It is a way to compare likelihood while keeping the number of fitted parameters into account.

In the specific case of Gaussian distributed errors with equal variance (note that this might not need to be your case, this is an example): With a simplified expression for Gaussian distributed error terms (Calculate AIC for both linear and non-linear models) $AIC = 2k + n \log(RSS) - n \log(n) + C$ in your case the $n\log(RSS)$ term changes only a little (namerly 64 * log(724.2*62) ≈ 10.71 versus 64 * log(730.3*63) ≈ 10.74, which is only a small improvement/reduction) and the AIC increases by much more due to the extra parameter, which increases the term 2k.

From this perspective the model without intercept is better.

An advantage or R² is that it is intuitive. However it has some limitations. For instance, unlike AIC mentioned above the R² does not take into account the number of parameters used in fitting a model. The issue with the intercept occurs because there are different methods to define and compute R². For instance it could be 'the correlation between observations and estimated values' and in simple linear regression, y = a + bx, this correlation does not change when we remove the intercept.

  • Thanks. What about the Residual Standard Error? – ottodidakt Oct 28 '22 at 13:31
  • @ottodidakt I see now that I made an error in my answer. I used the residual standard error in place of the residual sum of squares. The residual standard error $\hat\sigma_\epsilon$ is the estimate of the standard deviation of the error terms. It is $\hat\sigma_\epsilon = \sqrt{\frac{RSS}{n-p}}$ So possibly the model with intercept is better, I will have to recalculate that. – Sextus Empiricus Oct 28 '22 at 16:08