11

In Supervised Machine Learning, and specifically on Kaggle, it is usually seen that tree models often outperform linear models. And even in the tree-based models, it is usually XGBoost that outperforms RandomForest, and RandomForest outperforms DecisionTrees.

If this is not true, then please feel free to correct this assumption.

These are just my observations and somehow, a bunch of people share this opinion.

Why should we even use Linear models such as Linear Regression or Logistic Regression? Specifically, when they supposedly do not perform as well as tree-based models and have more requirements than tree-based models?

A similar question can be raised for tree-based models about using DecisionTrees instead of RandomForest or XGBoost.

Are there some cases where linear models should be preferred?

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
letdatado
  • 325
  • 14
    Different tools for different questions. All of the examples you give are 'find whatever pattern fits the data best' (because you, the analyst, won't or can't specify it). There are many fields where it is equally if not more important for a model to tell you why (or at least how) it arrived at a specific prediction, i.e. for you to be able to infer feature importance or a data-generating mechanism. And in several cases you don't want the model to be driven by what is in your current dataset at all, this is how you get to 'confirmatory'. – PBulls Dec 26 '23 at 07:41

7 Answers7

21

Many excellent answers already. I would add a few more aspects.

You do not say what you mean by "outperform", as in "tree models often outperform linear models". You presumably mean something like "tree models yield higher out-of-sample prediction accuracy". However:

The other answers allude to an issue that should be considered explicitly: accuracy is not the only performance dimension. There are at the very least aspects like:

  • Interpretability - where linear models give you parameter estimates you can interpret
  • Debuggability - which is easier for linear models than for more complex ones, related to interpretability as above - your users may ask you to improve "bad" predictions in specific cases, and it's much easier to do so for linear models
  • Robustness - which may or may not be better for trees than for linear models, especially if you regularize - you may prefer lower accuracy on average as long as you get fewer "really bad" predictions
  • Data requirements - as others noted, Kaggle usually has large datasets, but a real use case may have much less clean data, and simple methods may be surprisingly hard to beat especially in low data situations
  • Other resource requirements - one key aspect here being expertise; a complex model requiring a lot of different data may require a data engineer and a highly qualified data scientist to get to work (but other requirements may be computing resources or something as prosaic as electricity; Petropoulos et al., 2023)
  • Business value - everybody likes higher accuracy, but higher accuracy does not necessarily directly translate into higher business value (e.g., Kolassa, 2023, Foresight), and this needs to be compared to the higher resource costs per above

Data scientists naturally have a predilection for higher accuracy, and rightly so. But we do not work in a vacuum, and our models always need to be justified in the larger context, which includes other users and budgets.

Stephan Kolassa
  • 123,354
  • In point 5, that starts with "Other resource requirements" you mentioned that complex models require more resources. Don't you think that the tree based models have less assumptions, and lower pre-processing requirements than of linear models? This is what I have seen. I'd be glad to be corrected. – letdatado Dec 27 '23 at 11:58
  • 1
    On the one hand, I was also thinking of other methods than trees, like various flavors of neural networks and deep learning, and there, computing resource requirements are indeed often higher than for linear models (plus parameterizing the net requires a lot of expertise). For tree based methods, it all depends: growing a single tree is simple and fast - and likely not very good. Boosting is still rather fast, and better in terms of accuracy, but for large datasets, even training a boosted model will take a while. Per my comment to Sextus' question, it all depends on your situation. – Stephan Kolassa Dec 27 '23 at 12:50
16

For most problems it is easy to beat a tree model with a regression model. That's because tree models allow for all possible interactions among predictors and these are seldom needed. Interactions are notoriously hard to estimate, being double or triple differences etc., and much higher sample sizes are needed to estimate double differences (e.g., 4x greater sample sizes than needed to estimate simple differences = main effects, in an ideal case). Deficiencies in tree performance are often invisible to the user who does not assess absolute predictive accuracy of the tree or tree ensemble method. Once you get used to plotting unbiased estimates of smooth calibration trees you'll see what I mean.

An enlightening paper is this which showed that tree methods require more than 10x higher sample sizes than regression, in order to be reliable. For single tree it's much higher than that. Trees make poor use of available data and single trees can't handle continuous predictors correctly. More here.

In my experience random forests set the record for horrendous calibration accuracy (and overfitting).

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • 4
    Anecdotally, this is not my experience working with moderate sized data sets (~10k examples, ~20 features). Random Forrests have consistently out performed regression models in terms out of sample error in my experience with the type of data I work with. – Cliff AB Dec 27 '23 at 02:52
  • 1
    Cliff I am sure that only applies when using "default" regression, e.g., assuming all effects are linear. And make sure that "sample error" is a proper accuracy score. I have seen horrendous calibration from RFs with larger datasets than yours. – Frank Harrell Dec 27 '23 at 10:45
  • 1
    Related to this, anecdotally, I see most comparisons involving basic OLS with ML methods. This is misleading and is not levelling the playing field.When one can incorporate a rich internal structure (e.g. splines or even GAMs, sensible interactions, perhaps some regularization), regression models become far more competitive and often beat ML models. – Thomas Speidel Jan 03 '24 at 18:41
13

Linear regression is also existent outside machine learning where it requires much less data. Why does Machine Learning need a lot of data while one can do statistical inference with a small set of data?

It is largely a matter of bias-variance trade-off. Kaggle problems often have large datasets that help to reduce variance and allow less biased more flexible models like tree models. On the other hand, too much variation/flexibility is neither good (leads to overfitting). If ensemble methods like random forest do better than a plain decision tree, then it is because there is too much flexibility and it needs to be tamed.

Why do 'we' still use linear models?

You still get people that try linear models because that is how Kaggle works. People with all sorts of backgrounds try to give it a go at some problem. If they end up at the bottom of the ranking then it just means that the problem requires a flexible model like a tree model: it doesn't make linear regression useless for other problems. Linear regression works very well when the true/best model is close to the solution space of a linear model.

Also related: Why do we even bother running regression models? where it is suggested to fit a plethora of variable models instead of a single one that makes sense (because we have enough data anyway).

For small problems a linear model might also be more easily interpretable. Such a situation is highlighted in the question Why would you perform transformations over polynomial regression? .

You may tackle a bunch of data with all sorts of variable models as in the image below:

different fits

But a simple model is sometimes easier to interpret, like this:

linear

  • 10
    "Kaggle problems often have large datasets" - exactly. Relying on Kaggle to figure out "what works" is susceptible to selection bias for this exact reason. By all means, look to Kaggle to find inspiration, but to figure out what works in a specific situation, one needs to understand the specific situation, and something much simpler may add much more value. – Stephan Kolassa Dec 26 '23 at 12:34
  • 1
    Tree models also give no way to extrapolate. If you have training data that is really chosen iid from the population then trees are great. But if you need to extrapolate then they are of no use at all (I think). – Simd Dec 27 '23 at 11:22
  • @Simd that's a good point, but the occasions where it matters is a niche. The realm where tree models perform well occurs when linear models become too simple and their extrapolation possibilities are much less valuable if not even detrimental. So it must be in intermediate cases where there are a lot of variations from simple linear models (which favours decision tree based models), but not that much to completely invalidate linear models. In simple words, tree models are good when linear models make no sense, and if linear model make no sense, then extrapolation makes even less sense. – Sextus Empiricus Dec 27 '23 at 12:04
  • So yes, linear models can be used for extrapolation. But in the battles between tree models and linear models, where the outcomes are a small difference, then this extrapolation has not much advantage. – Sextus Empiricus Dec 27 '23 at 12:06
13

In addition to the excellent answers given already, a regression model yields parameter estimates, which tree models do not. If all you are interested in is prediction (which, as I understand it, is the case in Kaggle problems) then this might not matter so much. But science is about explanation, as well.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
9

I share your observation that on well-behaved tabular data (quite many rows, not too many columns), a well-built boosted trees model usually beats a random forest, and a random forest usually beats a single tree.

One of the main deficits of a tree-based model over a linear model is the wigglyness/non-smoothness of the prediction function. Another one, of course, is the ease of interpretability.

One thing that I particularly like about Boosted Trees models is that you can easily modulate their complexity via tree depth (1 = additive, 2 = pairwise interactions), and interaction constraints. Combining tree-depth 2 with interaction constraints, e.g., allows to build models with some features modeled additively for maximal interpretability, while others bear pairwise interactions. Additionally, carefully set monotonicity constraints help to reduce wiggliness of the prediction function. You won't find these options on Kaggle, but they greatly improve interpretability behavior of the model.

References on constrained Boosted Trees

  1. Michael Mayer and Christian Lorentzen, Lecture notes "Responsible ML with Insurance Applications" (ETH Zurich), Chapter 3
  2. Python notebook with explanations in https://github.com/samathizer/boosted-trees (work in progress)

Edit

Another advantage of linear models: direct availability of inferential methods, even if they require additional assumptions to be sufficiently met.

Michael M
  • 11,815
  • 5
  • 33
  • 50
  • 2
    +1. Do you know of any tutorials, articles or examples which cover the techniques you describe in the last paragraph? – 8e9yQBKVlIDwoIVegfkJ Dec 26 '23 at 11:22
  • 2
    Good point! I have added two links to my own stuff. Implementations of both types of constraints are available, e.g., in XGBoost, LightGBM, and in Scikit-Learn's HistGradientBoostingRegressor. – Michael M Dec 26 '23 at 12:06
1

You can view xgboost as a stepwise linear regression adding nonlinear inputs.

So it's more a question of how complicated you want to make the model

seanv507
  • 6,743
0

To add to mentions of interpretability concerning linear models, this tends to open the door more for analytical explanations--equations connected to other equations and constants through theoretical reasons, as happens in like physics, theory explains, experiments observe, within error, e.g. V = IR.

  • 3
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Community Dec 27 '23 at 04:55