2

This plot is taken from a gradient boosting regression example in the scikit-learn documentation. What does deviance mean? How should this plot be interpreted? In which case do we have over/underfitting? What improvements can we make to the model parameters based on this plot?

# Author: Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Maria Telenczuk <https://github.com/maikia>
#         Katrina Ni <https://github.com/nilichen>
#
# License: BSD 3 clause

import matplotlib.pyplot as plt import numpy as np

from sklearn import datasets, ensemble from sklearn.inspection import permutation_importance from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split

diabetes = datasets.load_diabetes() X, y = diabetes.data, diabetes.target

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.1, random_state=13 )

params = { "n_estimators": 500, "max_depth": 4, "min_samples_split": 5, "learning_rate": 0.01, "loss": "squared_error", }

reg = ensemble.GradientBoostingRegressor(**params) reg.fit(X_train, y_train)

mse = mean_squared_error(y_test, reg.predict(X_test)) print("The mean squared error (MSE) on test set: {:.4f}".format(mse))

The mean squared error (MSE) on test set: 3025.7877

test_score = np.zeros((params["n_estimators"],), dtype=np.float64) for i, y_pred in enumerate(reg.staged_predict(X_test)): test_score[i] = mean_squared_error(y_test, y_pred)

fig = plt.figure(figsize=(6, 6)) plt.subplot(1, 1, 1) plt.title("Deviance") plt.plot( np.arange(params["n_estimators"]) + 1, reg.train_score_, "b-", label="Training Set Deviance", ) plt.plot( np.arange(params["n_estimators"]) + 1, test_score, "r-", label="Test Set Deviance" ) plt.legend(loc="upper right") plt.xlabel("Boosting Iterations") plt.ylabel("Deviance") fig.tight_layout() plt.show()

enter image description here

  • If you post some of your code where you specify the deviance or (more likely) the likelihood used in your boosting model, we can say a bit more about what your value means and how it is calculated. Technically, deviance is a function that satisfies a few axioms, but there are typical functions that have statistical meaning. – Dave Nov 28 '23 at 13:26
  • Does something in the documentation say that you are getting "deviance" values? Your code shows that you wrote that word yourself, and while it has a meaing in statistics that makes sense in this context, you just as easily could have used "Squid" as your axis label. – Dave Nov 28 '23 at 14:18

1 Answers1

4

Deviance is a measure of model quality typically (but, I suppose, not necessarily) related to the likelihood. The lower the deviance, the better the model fit. Perhaps think of it this way: models are good when, in some sense, the predicted values are close to the observed values, and deviance is the quantification of what it means to be close. Deviance functions do not, necessarily, satisfy the axioms of a metric space, so "distance" between true and predicted values need not apply, but that's why we call it the "deviance" between true and predicted values.

Out of context, it is hard to say if your deviance value of $\sim$$3000$ is too high to make for a good model. In general, it is hard to say if a particular measure of performance is good without some reference to the context. One takeaway from the plot, however, is that out-of-sample performance improves quite minimally after about $150$ boosting iterations. You don't seem any worse off for continuing your boosting iterations beyond $150$ or $200$, as the training deviance does not start to shoot up, but the time you spend (which migh also involve money if you have to buy that time on a service like AWS) is not leading to much improvement and could be argued not to be worth performing.

EDIT (It's mean squared error.)

From the code, it is evident that deviance is calculated as the mean squared error. In this case, deviance is calculated as $\dfrac{1}{n}\overset{n}{\underset{i = 1}{\sum}}\left(y_i - \hat y_i\right)^2$. Why they elect to refer to this quantity as the more obscure "deviance" instead of the quite typical "mean squared error" is not clear. Perhaps the goal is to to be general, so the axis labels would apply even if a binomial deviance were used, such as in a binary "classification" problem. Still, I am not sold on the pedagogy.

Dave
  • 62,186