0

When I test the GBM boosting model on the Caravan data set and predict whether there will be a purchase I get all positive values. I thought I was supposed to log-transform the data to get whether a prediction is greater than 50% and in that case it will be "Yes" otherwise "No". I am not sure how to interpret this: which values correspond to "Yes" and to "No" respectively?

library(ISLR2)
data(Caravan)
Caravan.train = Caravan[1:1000,]
Caravan.test = Caravan[-c(1:1000),]
caravan.boost = gbm(Purchase~., data = Caravan.train,
                    distribution = "bernoulli", n.trees = 1000,
                    interaction.depth = 4, shrinkage = 0.01)
caravan.inf = summary(caravan.boost)
caravan.inf2 = caravan.inf[which(!(caravan.inf$rel.inf==0)),]
caravan.sort = caravan.inf2[order(-caravan.inf2$rel.inf),]
# The most important variables are:
# PPERSAUT, PBRAND, MKOOPKLA, MGODGE, MOPLHOOG, MOSTYPE, MINK3045 and
# MBERMIDD.
caravan.pred = predict(caravan.boost,newdata = Caravan.test)

> caravan.pred [1] 1.0371814 1.0948238 1.0139326 1.1473339 1.2249934 1.2160427 1.0242501 [8] 1.1058666 1.1312732 0.9999499 1.0495137 0.9754117 1.0317667 1.0338068 [15] 1.0238197 1.0538183 1.1044108 1.1857257 0.9775377 1.0494232 0.9663663 [remaining data removed for clarity]

I try to log transform them and I get lots of NaNs:

> log(caravan.pred/(1-caravan.pred))
   [1]       NaN       NaN       NaN       NaN       NaN       NaN       NaN
   [8]       NaN       NaN  9.902012       NaN  3.680591       NaN       NaN
  [15]       NaN       NaN       NaN       NaN  3.773200       NaN  3.358013 [remaining data removed for clarity]
Warning message:
In log(caravan.pred/(1 - caravan.pred)) : NaNs produced
Dave
  • 62,186
  • That is a logit transform defined only for probabilities (arguments) In (0, 1). I am not familiar with this model and don't use this software, but evidently logit transformation makes no sense here. – Nick Cox Feb 18 '23 at 08:33
  • It makes no sense, and will be mathematically problematic, to model the logarithms of purchase amounts as a bernoulli variable. But that's what your notation suggests. If that's not the case, please tell us what kinds of numerical quantities are stored in your Purchase variable. – whuber Feb 18 '23 at 15:42

1 Answers1

3

I think the transformation to which you refer is to convert a log-odds $L$ to a probability $p$.

$$ \log\left(\dfrac{ p }{ 1-p }\right)=L\\ \Updownarrow\\ p=\dfrac{ 1 }{ 1+e^{-L} } $$

You seem to have the log-odds of an event, so it is just a bit of a calculation to convert to the probability.

As far as why you have all of the log-odds greater than one, there is imbalance in your $Purchase$ outcome variable. Consequently, you are telling the model to expect most of the outcomes to belong to one class. It is quite typical for models of imbalanced problems to return probabilities that greatly favor the majority category. I will venture a guess that the internal workings of the function that fits the model codes the majority category as $1$ and the minority category as $0$, thus resulting in the predictions you see that favor membership in the majority category that corresponds to group $1$.

As far as which predictions correspond to which category, all of your predicted log-odds are above $0$, so all of your predicted probability values will be above $0.5$. If you set $0.5$ as the threshold for considering an observation to belong to a category, then get that all categorical predictions are the same. If this is unacceptable, you can consider using a different threshold or evaluating directly the predicted probabilities.

An important Meta discussion on class imbalance that links to lots of good information

Additional pertinent links

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

Why is accuracy not the best measure for assessing classification models?

Our Frank Harrell has two good related blog posts, too.

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Classification vs. Prediction

Dave
  • 62,186
  • Could you explain how you determined there is "imbalance in your Purchase outcome variable" and what you mean by that? I have been unable even to determine what kinds of numerical quantities are stored in it, but the OP refers to "log-transform the data." – whuber Feb 18 '23 at 15:44
  • @whuber table(Caravan$Purchase) // I do not see a logarithm taken in the code, so I am left to speculate that what is meant is the log-odds like in a logistic regression, and I think the OP is taking the log-odds instead of inverting the log-odds to get probabilities. – Dave Feb 18 '23 at 15:59
  • Thanks -- I didn't see that the data are available with an R package, which indicates Purchase is either "Yes" or "No". – whuber Feb 18 '23 at 16:07