Why don't we always fit fully-interacted models?

Question

As far as I know:

y ~ x + z means y is dependent on x and z variables
y ~ x : z means y is dependent on the interaction between x and z
y ~ x * z means y ~ x + z + (x : z)

It looks like * already covers + as well as :. The question is why don't we use y ~ x * z much but y ~ x + z in functions like lm()? Wouldn't we get more accurate results if we used * all the time?

Why not include all interactions then? Surely the model will work better, right? — user2974951, Sep 03 '21 at 06:52
@user2974951 Are you trying to say "to reduce computational costs"? What if a very unlikely interaction actually affects the dependent variable greatly? After all, one purpose of using deep learning is to find out patterns that we humans can not perceive readily. — Xfce4, Sep 03 '21 at 06:57
It seems like this question can be restated as "why don't we always fit fully-interacted models?" — shadowtalker, Sep 03 '21 at 07:33
Someone voted to close this as a programming request. I am with @shadowtalker that R syntax is just superficial here and this is really about when to include all interaction effects and whether that naturally improves accuracy. It is thus on topic for CrossValidated — Bernhard, Sep 03 '21 at 08:59
@Bernhard I agree that this question is not about R, but I am not sure whether objecting "off-topic" assessments should be done in comments. Where do you see this vote and what is the rigth place to object it? — cdalitz, Sep 03 '21 at 09:04
@cdalitz Comments are where (some) people explain their closing votes so I figured this was the right place to oppose. If that is against the rules then I apologize. — Bernhard, Sep 03 '21 at 09:11
@Bernhard I do not know whetehr it is against the rules. This was a honest question: where do you see the votes for closing a question? — cdalitz, Sep 03 '21 at 10:27
@cdalitz I believe you need 3000 reputation to cast close votes ( https://stats.stackexchange.com/help/closed-questions ) and then it shows next to the link to casting close votes. Now I am quite shure these comments do no longer count as commenting the question so let's end this here. — Bernhard, Sep 03 '21 at 11:01
@Bernhard: Nobody has broken any rules in this comment thread! And for the record, I agree this Q is on-topic here, and will vote accordingly in the review queue (accessible for users with 3000+) — kjetil b halvorsen, Sep 03 '21 at 12:10

Bernhard · Answer 1 · 2021-09-03T09:14:13.910

So let's assume that Y depends on A, B, C and D and we have 30 observations.

The appropriate formula is not Y ~ A * B * C * D because then lm will have to find 16 coefficients:

> attr(terms.formula(y ~ A *B*C*D), "term.labels")
 [1] "A"       "B"       "C"       "D"       "A:B"     "A:C"    
 [7] "B:C"     "A:D"     "B:D"     "C:D"     "A:B:C"   "A:B:D"  
[13] "A:C:D"   "B:C:D"   "A:B:C:D"

When would estimating 16 coefficients from 30 observations be sensible? How do you explain the intuition behind A:B:C:D to anyone?

I assume this question comes from a machine learning perspective where data is often assumed to be plentiful and complex models just need to be guarded from overfitting. Well, making simple models is useful in statistics when data is not plentiful, it's useful when you are guarding against overfitting, it is useful when you have a good idea about the data generating process and you want to investigate that specific problem and in many other places.

And do not get me started with the number of coefficients to be computed once we take dummy coding into consideration. If you asked 200 people who all have a gender (2 dummies or more), all have some education on a 8 level scale (7 dummies), all come from one out of 5 countries (4 dummies), all are more or less religious on a 5 point scale (4 dummies) and all have Big Five personalities (5 predictive values). Are we really interested in finding an interaction term for women with a university degree from Spain who are not religious times all five of their personality values? I think everyone can see how this is quickly going to be ridiculous.

On the other hand, sometimes my income is just the some constant times the hours I work in my first job plus some constant times the hours I work on my side job. There is really no point multiplying my work hours. And if you did, you'd compute something with the unit "Dollars per square hours". You do not want to make that the default.

Kernel models and Gaussian processes work very well in infinite dimensional spaces with small samples of data. Regularisation can be very effective. For quadratic kernels, you can perform an analysis afterwards to find out which quadratic terms are "important" in some (but not all applications). But +1 because I agree with the basic point, a great deal of skepticism is required. — Dikran Marsupial, Sep 03 '21 at 10:04
Thank you. But one reason we use machines for data analysis is to find out the patterns and relations that we can't regard as possible. So, if we, humans, conclude at the very beginning that an interaction cannot be of significance, then do we not lose objectivity and contradict with our purpose of using machines? — Xfce4, Sep 04 '21 at 22:29
Even if you investigate all interactions you still have not investigated every quadratic term nor every logarithmic term nor... (and theirinteractions). In those cases where you approach a regression so blindfolded it will often be more sensible to use a generally more flexible non-linear approach such as a random forest or a neural net. — Bernhard, Sep 05 '21 at 06:19
I am not saying that you should not use '*' when appropriate. I use it regularly. But not as a standard. As for interactions in random forests you may be interested in https://stats.stackexchange.com/a/504293/117812 — Bernhard, Sep 05 '21 at 06:29
@Bernhard No notification about your reply arrived to me. Anyway, it was a cool point on polynomial or logarithmic relationships. Taking all interactions into account does not mean we included everything. But still, I feel like we should not restrict machines according to our intuition. I think I understand the concern about computational costs and lack of resources but once we restrict the machines we start to use them as simple calculators rather than artificially 'intelligent' devices. — Xfce4, Sep 05 '21 at 14:13

score 1 · Answer 2 · edited Sep 03 '21 at 08:52

1

This is quite a deep question, and it is a matter of much debate (here on CV, too). It depends on your goals:

If your goal is prediction, it is best to use as much information as available. The only possible problem might be overfitting, which can be checked with cross-validation or separate hold-out data.
If your goal is explanation, a simpler model will be easier to understand as long as the most influential effects are included.

Thus, for predictive purposes, it is generally even better not to assume a linear relationship at all, but to use spline regression, LOESS, or Random Forests (if you have tons of data).

For a polemical discussion of this topic, see

Leo Breiman: "Statistical modeling: The two cultures." Statistical Science 16.3, pp. 199-231 (2001)

And for a more sober discussion:

Galit Shmueli: "To explain or to predict?." Statistical Science 25.3, pp. 289-310 (2010)

edited Sep 03 '21 at 08:52

Nick Cox

56,404
8
127
185

answered Sep 03 '21 at 08:01

cdalitz

5,132

3

I would say even if your goal is prediction, I would try to limit the number of variables, because of that whole "curse of dimensionality" thing. – user2974951 Sep 03 '21 at 08:39
@ user2974951 The traditional approach to limiting the number of variables is to compute a small set of "more informative" features and use these for prediction. Methods include PCA, LDA, or more problem specific methods like shape descriptors or SIFT. It turned out that if (that's a big if!) you have large amounts of training data, not doing so and let a neural net ("deep features") or random forest ("boruta") compute/select the features yields better prediction than a priori dimension reduction. There seems to be a "blessing of dimensionality". – cdalitz Sep 03 '21 at 08:54
If you are interested in prediction regularisation is generally better than feature selection (advice from the appendix of Millers' monograph on feature subset selection). So try interactions, but regularise to control overfitting. – Dikran Marsupial Sep 03 '21 at 09:27
@cdalitz Even if we had large data (big if!). Let us assume we investigate the performance of some machine which we understand quite well. We have a solid theoretical idea about which parameters enter the question linearly or squared or which interaction effects make sense and which do not. Is not the risk of overfitting better controlled by stating this a priori knowledge in form of a well defined modell instead of relying on the black box to find out by itself? I cannot imagine a situation in which the OP (ctd.) – Bernhard Sep 03 '21 at 09:30
(continuation) I cannot imagine a situation in which the OP a) has really large data and b) has a situation in which a black box model fits well because he has no good idea which predictors might interact and c) makes an informed decision to opt for lm() for prediction. – Bernhard Sep 03 '21 at 09:32
@Bernhard I helped organise a machine learning challenge for a conference a few years ago to see if using expert knowledge would improve performance over machine learning methods applied without knowledge of what the learning task was about. Generally expert knowledge made the results worse rather than better (the task at the first conference was "agnostic learning" and the info about the data was available for the same challenge at the conference the following year IIRC). – Dikran Marsupial Sep 03 '21 at 09:53
... which was a surprise, it really wasn't what we expected to happen! – Dikran Marsupial Sep 03 '21 at 09:54
@dikran-marsupial Quite interesting. Would you mind sharing the publication of these observations? – cdalitz Sep 03 '21 at 10:28
1

@cdalitz the paper is here https://doi.org/10.1016/j.neunet.2007.12.024 My memory may be a bit vague about the findings, it was a while back! I was a fairly minor contributor, I think I mainly did the baseline methods. – Dikran Marsupial Sep 03 '21 at 11:15

Why don't we always fit fully-interacted models?

2 Answers2