1
tail -c +43 uYayd.gif > TROW.tsv
tail -c +43 bAEMc.gif > AABB.tsv

Using the two files above, I can run linear models on them.

The following seems to indicate that either ema21diff or ema89diff can be used for the fitting very well.

R> summary(lm(futrdiff ~ ema21diff, data=TROW))

Call: lm(formula = futrdiff ~ ema21diff, data = TROW)

Residuals: Min 1Q Median 3Q Max -6.9238 -1.4405 0.0598 1.8670 8.0834

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.32199 0.38956 3.394 0.000899 *** ema21diff -0.66179 0.08244 -8.027 3.77e-13 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.479 on 139 degrees of freedom Multiple R-squared: 0.3167, Adjusted R-squared: 0.3118 F-statistic: 64.44 on 1 and 139 DF, p-value: 3.774e-13

R> summary(lm(futrdiff ~ ema89diff, data=TROW))

Call: lm(formula = futrdiff ~ ema89diff, data = TROW)

Residuals: Min 1Q Median 3Q Max -5.5066 -1.7942 -0.0663 1.6676 7.6233

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.72792 0.93537 7.193 3.58e-11 *** ema89diff -0.52376 0.05945 -8.811 4.52e-15 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.402 on 139 degrees of freedom Multiple R-squared: 0.3583, Adjusted R-squared: 0.3537 F-statistic: 77.63 on 1 and 139 DF, p-value: 4.515e-15

R> summary(lm(futrdiff ~ ema21diff + ema89diff, data=TROW))

Call: lm(formula = futrdiff ~ ema21diff + ema89diff, data = TROW)

Residuals: Min 1Q Median 3Q Max -5.7963 -1.7125 0.0304 1.7103 7.6391

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.4699 1.3091 4.178 5.18e-05 *** ema21diff -0.2148 0.1569 -1.369 0.1732 ema89diff -0.3861 0.1167 -3.308 0.0012 **


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.395 on 138 degrees of freedom Multiple R-squared: 0.3669, Adjusted R-squared: 0.3578 F-statistic: 39.99 on 2 and 138 DF, p-value: 1.993e-14

The following seems to indicate only ema89diff matters, but ema21diff is not.

R> summary(lm(futrdiff ~ ema21diff, data=AABB))

Call: lm(formula = futrdiff ~ ema21diff, data = AABB)

Residuals: Min 1Q Median 3Q Max -6.6453 -1.0660 0.1424 1.5878 3.7737

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.82788 0.18510 -4.473 1.59e-05 *** ema21diff -0.29036 0.08208 -3.537 0.00055 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.021 on 139 degrees of freedom Multiple R-squared: 0.08258, Adjusted R-squared: 0.07598 F-statistic: 12.51 on 1 and 139 DF, p-value: 0.00055

R> summary(lm(futrdiff ~ ema89diff, data=AABB))

Call: lm(formula = futrdiff ~ ema89diff, data = AABB)

Residuals: Min 1Q Median 3Q Max -5.6130 -1.0894 0.1935 1.4290 4.4952

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.32680 0.32840 0.995 0.321 ema89diff -0.29094 0.05865 -4.961 2.02e-06 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.945 on 139 degrees of freedom Multiple R-squared: 0.1504, Adjusted R-squared: 0.1443 F-statistic: 24.61 on 1 and 139 DF, p-value: 2.018e-06

R> summary(lm(futrdiff ~ ema21diff+ema89diff, data=AABB))

Call: lm(formula = futrdiff ~ ema21diff + ema89diff, data = AABB)

Residuals: Min 1Q Median 3Q Max -5.578 -1.140 0.206 1.361 4.593

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.6154 0.4542 1.355 0.177682 ema21diff 0.1345 0.1462 0.920 0.359045 ema89diff -0.3750 0.1086 -3.454 0.000733 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.946 on 138 degrees of freedom Multiple R-squared: 0.1556, Adjusted R-squared: 0.1434 F-statistic: 12.71 on 2 and 138 DF, p-value: 8.547e-06

It is trivial to manually examine model selection like this. Could anybody show me an automated and commonly used way to detect the best linear models (possibly almost equivalently best models) for a fitting?

  • Perhaps you want the step function. Or this question might help: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection – Sam Rogers Aug 03 '21 at 06:33
  • 4
    What is your goal? Is it about predicting? Is it about something else? Looking at multiple possible models and then doing something with a single selected model afterwards has issues (that are less or more bad depending on what you are trying to do, and how you do it e.g. stepwise regression is known to be particularly bad). Do you need to decide on a single model (i.e. could averaging - with data-determined-weights - over the candidate models be an option)? – Björn Aug 03 '21 at 08:33
  • 2
    Automatic model selection = analyst turning over thinking tasks to the computer. The rumor that "unimportant" variables should be dropped from models should have been squashed in the 1960s. – Frank Harrell Aug 03 '21 at 11:47

1 Answers1

1

What you have done by hand is automatically done by the R function anova when given a single fitted model:

> anova(lm(mpg ~ disp + wt + cyl, data=mtcars))
Analysis of Variance Table

Response: mpg Df Sum Sq Mean Sq F value Pr(>F)
disp 1 808.89 808.89 120.158 1.221e-11 *** wt 1 70.48 70.48 10.469 0.003111 ** cyl 1 58.19 58.19 8.644 0.006512 ** Residuals 28 188.49 6.73


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Variables are added in the order in which they are listed in the model formula.

Beware however the comment by @björn to your question: this is an example of a stepwise method. And, what is even more problematic, the evaluation criterion is internal (the data used for training the model is resubstituted for evaluation), which might result in overfitting on peculiar properties of your training data. If you care about prediction accuracy, you might consider a selection criterion based on cross-validation, e.g. the leave-one-out mean squared prediction error.

cdalitz
  • 5,132