0

I would like to perform inference on a binary classification problem.

I have a logistic regression with a mix of binary and continuous inputs. I would like to perform a feature cross (e.g. add some non-linear terms) in my regression and then infer how these variables increase or decrease the probability of observing a positive result. So, I would like to be able to make a statement about my predictors when taking into consideration non-linear interactions between the two.

So if I have binary variable x1 and continuous variable x2 (ranges from 0 to 1) and perform a simple logistic regression using statsmodels for example, I might get back something along the lines of...

variable coef other columns such as std error, p-value, etc.
constant -4 ~
x1 -2 ~
x2 -0.2 ~

From this, assuming everything is significant, I might conclude that increases in x1 and x2 are both associated with decreased probability of observing a positive result. But if I also believed there might be a non-linear effect and wanted to test a logistic regression with the following argument, instead: ~ ax1+bx2+cx1x2

I would get the following output from statsmodels

variable coef other columns such as std error, p-value, etc.
constant -4 ~
x1 -3 ~
x2 -0.2 ~
x1*x2 6 ~

I can see that the non-linear term increases the probability of a positive result, but the coefficient for x2 has also changed, which is expected, but the inference is no longer simple. If there was no error on the estimates of the coefficients, it might be easy to say that when x2 is above the value of 3/5.8 and x1 is 1, that the probability of a positive result increases. I come to this conclusion since -3*(x1)-0.2*(x2)+6*(x1)(x2)>0 simplifies to -3+5.8(x2)>0 when x1 = 1. But since my coefficients have a range of values, I can't perform this simple algebra.

Ultimately, is there some way to gauge whether or not variables like x1 and x2 in fact increase the probability of observing a positive result when factoring in non-linear interactions like the one I outlined? Is there perhaps a more simple way to perform such an analysis that I have overlooked (maybe logistic regression is not the way to go)? Finally, would such an approach change if x1 and x2 were both continuous, or both binary?

I am usually more interested in the construction of classifiers as opposed to inferences that one can make based off of the classifier model that gets fit, so I apologize if this question comes across naïve.

  • These kinds of marginal effects are supported in Stata by the margins. However, the corresponding margeff in statsmodels only supports single column terms but not multicolumn terms like interacted effects. The marginal effect in nonlinear models depends on the values of explanatory variables and the effect might not be monotonic over the space of explanatory variables. – Josef Oct 10 '22 at 17:03

1 Answers1

0

is there some way to gauge whether or not variables like x1 and x2 in fact increase the probability of observing a positive result when factoring in non-linear interactions

When there's a significant interaction, there's no simple way to evaluate the overall associations between single predictors and outcome. The association between x1 and outcome depends of the values of x2, and vice-versa. Even the single-predictor coefficients become tricky to interpret, as each represents the association with outcome only when its interacting predictors are at reference levels (interacting categorical predictors) or at 0 (interacting continuous predictors). Just centering a predictor can change the single-predictor coefficients of all the other predictors it interacts with.

For inference on particular combinations of predictor values as you propose, you can use the formula for the variance of a sum of correlated variables to get standard errors and confidence intervals, based on the assumption of asymptotic normality of the coefficient estimates in logistic regression. That handles the "range of values" of the coefficient estimates. For inference on the overall association between outcome and a predictor involved in interactions or other nonlinear terms, you can perform a multiple-parameter Wald test on the set of coefficients involving the predictor. For those analyses you need to use the covariance matrix of the coefficient estimates, which is typically hidden from standard model summaries but is contained in the model object.

There are many tools available for this type of post-modeling analysis in R, for example in the rms, emmeans, and car packages. I don't use statsmodels but I imagine that there is some similar functionality provided by Python packages.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Out of curiosity, if I had instead 10 predictors, would I have to test all 45 different possible interaction terms at once in a given model, or could I still make a statement about statistically significant interactions by testing pairs of 2 predictors at a time in a logistic model with a single interaction term (similar to what I outlined in my original post) and focus on interaction terms with the largest magnitude, and statistically significant coefficients? – delsaber8 Oct 09 '22 at 18:41
  • @delsaber8 logistic regression has a problem with omitted variable bias. Any restriction to subsets of predictors is likely to lead to bias. After fitting a full model you can get post-modeling Wald chi-square statistics for any sets of coefficients, whether for individual predictors or interaction terms, that can be used (after subtracting the number of degrees of freedom, the mean under the null hypothesis) for evaluating relative contributions to a model. See this answer and its links. – EdM Oct 09 '22 at 19:30