So let's assume that Y depends on A, B, C and D and we have 30 observations.
The appropriate formula is not Y ~ A * B * C * D because then lm will have to find 16 coefficients:
> attr(terms.formula(y ~ A *B*C*D), "term.labels")
[1] "A" "B" "C" "D" "A:B" "A:C"
[7] "B:C" "A:D" "B:D" "C:D" "A:B:C" "A:B:D"
[13] "A:C:D" "B:C:D" "A:B:C:D"
When would estimating 16 coefficients from 30 observations be sensible? How do you explain the intuition behind A:B:C:D to anyone?
I assume this question comes from a machine learning perspective where data is often assumed to be plentiful and complex models just need to be guarded from overfitting. Well, making simple models is useful in statistics when data is not plentiful, it's useful when you are guarding against overfitting, it is useful when you have a good idea about the data generating process and you want to investigate that specific problem and in many other places.
And do not get me started with the number of coefficients to be computed once we take dummy coding into consideration. If you asked 200 people who all have a gender (2 dummies or more), all have some education on a 8 level scale (7 dummies), all come from one out of 5 countries (4 dummies), all are more or less religious on a 5 point scale (4 dummies) and all have Big Five personalities (5 predictive values). Are we really interested in finding an interaction term for women with a university degree from Spain who are not religious times all five of their personality values? I think everyone can see how this is quickly going to be ridiculous.
On the other hand, sometimes my income is just the some constant times the hours I work in my first job plus some constant times the hours I work on my side job. There is really no point multiplying my work hours. And if you did, you'd compute something with the unit "Dollars per square hours". You do not want to make that the default.