10

Let's say I am building a logistic regression model where the dependent variable is binary and can take the values $0$ or $1$. Let the independent variables be $x_1, x_2, ..., x_m$ - there are $m$ independent variables. Let's say for the $k$th independent variable, the bivariate analysis shows a U-shaped trend - i.e., if I group $x_k$ into $20$ bins each containing roughly equal number of observations and calculate the 'bad rate' for each bin - # observations where y = 0 / total observations in each bin - then I get a U shaped curve.

My questions are:

  1. Can I directly use $x_k$ as input while estimating the beta parameters? Are any statistical assumptions violated which might cause significant error in estimating the parameters?
  2. Is it necessary to 'linearize' this variable through a transformation (log, square, product with itself, etc.)?
  • 3
    Is $X_k$ one of the predictors you want to include in the model? It sounds like you're saying that $P(Y=1)$ is a U-shaped function of $X_k$. A U-shaped curve can often be well approximated by a quadratic function (especially near the valley) - have you considered including a linear and quadratic term in that variable into the model? – Macro Jul 17 '12 at 17:09
  • @Macro thanks for your suggestion. I have seen some modellers fitting a piecewise-linear function (assuming a sharp U shape) - where each line is estimated from the data with the breaks determined visually and then the output of the linear equation is provided as an input to the logistic regression. I am not a big fan of the approach though. – Mozan Sykol Jul 17 '12 at 18:31

3 Answers3

14

You would want to use a flexible formulation that would capture non-linearity automatically, e.g., some version of a generalized additive model. A poor man's choice is a polynomial $x_k$, $x_k^2$, ..., $x_k^{p_k}$, but such polynomials produce terrible overswings at the ends of the range of their respective variables. A much better formulation would be to use (cubic) B-splines (see a random intro note from the first page of Google here, and a good book, here). B-splines are a sequence of local humps:

http://ars.sciencedirect.com/content/image/1-s2.0-S0169743911002292-gr2.jpg

The height of the humps is determined from your (linear, logistic, other GLM) regression, as the function you are fitting is simply

$$ \theta = \beta_0 + \sum_{k=1}^K \beta_k B\Bigl( \frac{x-x_k}{h_k} \Bigr) $$

for the specified functional form of your hump $B(\cdot)$. By far the most popular version is a bell-shaped smooth cubic spline:

$$ B(z) = \left\{ \begin{array}{ll} \frac14 (z+2)^3, & -2 \le z \le -1 \\ \frac14 (3|x|^3 - 6x^2 +4 ), & -1 < x < 1 \\ \frac14 (2-x)^3, & 1 \le x \le 2 \\ 0, & \mbox{otherwise} \end{array} \right. $$

On the implementation side, all you need to do is to set up 3-5-10-whatever number of knots $x_k$ would be reasonable for your application and create the corresponding 3-5-10-whatever variables in the data set with the values of $B\Bigl( \frac{x-x_k}{h_k} \Bigr) $. Typically, a simple grid of values is chosen, with $h_k$ being twice the mesh size of the grid, so that at each point, there are two overlapping B-splines, as in the above plot.

StasK
  • 31,547
  • 2
  • 92
  • 179
  • 2
    How can you tell that he needs something this complicated? Certainly a quadratic term in a covariate x that appears to have that shape could be incorporated in the logistic regression model just the way the OP seems to want it. – Michael R. Chernick Jul 17 '12 at 19:49
  • 2
    @MichaelChernick Yes I accepted the answer as it taught me a new concept, but I will probably not need such a complicated solution. – Mozan Sykol Jul 17 '12 at 19:58
  • 3
    @Michael That's an important point. I am encouraged, though, by a comment by the OP mentioning an ad hoc piecewise linear fitting procedure. Splining works in the same spirit but with greater flexibility and rigor. Quadratic terms might work but that seems like a lot to hope for. – whuber Jul 17 '12 at 19:58
  • Looks more like piecewise constant to me: the mean value of the response within a bin is equivalent to the regression on the dummy variable/indicator of that bin, and that's piecewise constant... @MichaelChernick: there's no telling, generally, but I am yet to see an application where B-splines would be inferior to polynomial fit. – StasK Jul 19 '12 at 05:05
  • @StasK If inferior means that it dpesn't fit the data as well then I think a small sample size would favir a simple polynomial. – Michael R. Chernick Jul 19 '12 at 09:57
  • How do you chose the "whatever" number of knots? AIC? – cdalitz Jul 04 '22 at 12:31
5

Just like linear regression, logistic regression and more generally generalized linear models are required to be linear in the parameters but not necessarily in the covariates. So polynomial terms like a quadratic that Macro suggests can be used. This is a common misunderstanding of the linear term in generalized linear models. Nonlinear models are models that are nonlinear in the parameters. If the model is linear in the parameters and contains additive noise terms that are IID the model is linear even if there are covariates like X$^2$ log X or exp(X). As I now read the question it seems to be edited. My specific answer would be yes to 1 and not necessary to 2.

1

Another viable alternative that the modeling shop I work for routinely employs, is binning the continuous independent variables and substituting the 'bad rate'. This forces a linear relationship.

Zelazny7
  • 978
  • I've also seen this. How do you choose the bins? – Adam Ryczkowski Oct 01 '15 at 13:01
  • 1
    There are a variety of discretization methods you can use. R has the disco package. I created my own algorithm that recursively splits a continuous variable based on information value. I put it in an R package here: https://github.com/Zelazny7/binnr (work in progress!). I also would substitute the weight-of-evidence instead of the mean. When paired with LASSO regression the results are fantastic! – Zelazny7 Oct 02 '15 at 02:50
  • Thank you! Can you compare the binnr algorithm with the CRAN's smbinning? – Adam Ryczkowski Oct 02 '15 at 09:47
  • 'binnr' is written in C and is very fast. It also supports real world conditions like override values and monotonicity. It is possible to tell 'binnr' to only make discrete cuts when the bad rates are monotonic. 'smbinning' uses conditional inference trees which, in my experience, take a long time on decent sized data sets because they rely on permutations. 'binnr' also allows you to interactively make adjustments to the bins with a command line interface – Zelazny7 Oct 02 '15 at 12:22
  • Did you use it in production? Are you aware of any hidden bugs or unchecked corner cases? I think about using it for a large customer and I will have only 1 day to test it before applying your code. – Adam Ryczkowski Oct 02 '15 at 12:33
  • My company doesn't implement in R so I created functions that export the code to SAS. I would not consider the package production ready for the reasons you mention. It has not been used enough to find the corner cases yet. But I have been using it to build models for a couple of months and so far so good. But like I said, we convert them to SAS code. You can do that by executing cat(sas(my.bin.object)) – Zelazny7 Oct 02 '15 at 12:50
  • Thanks. In fact we also use SAS, I just lobby a little for R. AFAIK there is no logistic LASSO regression in SAS, so I can't fully duplicate the that part of the inference in our SAS environment. – Adam Ryczkowski Oct 02 '15 at 13:32
  • I use the glmnet package in R and simply use it to grab the coefficients. Coding the final equation is just a matter of summing the coefficients multiplied by the predictors. binnr prints out SAS code for this as well when it is passed a vector of coefficients. – Zelazny7 Oct 02 '15 at 13:56