The $\chi^2$ distribution describes a sum of independent standard normal variables. Although it's usually encountered by students first in the context of contingency tables, it has much wider use.
It's the basis of likelihood-ratio tests. The Wald test for coefficients of generalized linear models is based on an asymptotic $\chi^2$ distribution. The F-distribution used in analysis of variance and ordinary linear regression is based on a ratio of $\chi^2$-distributed variables.
So a $\chi^2$ value including all coefficients involving a predictor in a model (including nonlinear terms and interactions, or all levels of a categorical predictor) provides a useful summary of the contribution of that predictor to any regression model. If the predictors use up different degrees of freedom, comparison is best done by subtracting the corresponding degrees of freedom (the mean under the null hypothesis) from each $\chi^2$.
That said, be very wary of such attempts at automated model selection. Section 5.4 of Frank Harrell's course notes and book illustrate how variable such variable selection based on $\chi^2$ can be.
Illustration of this type of $\chi^2$ for predictor comparison
Other answers have shown that thescikit-learn function in question bins the continuous features to generate a contingency table. Here's an example of how you could use the Wald $\chi^2$ to evaluate predictor importance without binning. With the iris data set, do a multinomial regression of Species on the continuous predictors.
library(nnet)
mnIris <- multinom(Species~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = iris, Hess=TRUE,maxit=200)
With code (shown below) to extract the 2 coefficients for each continuous predictor and the corresponding subset of the coefficient covariance matrix, display the $\chi^2$ values.
for(pred in names(iris)[1:4]) WaldChisq(mnIris,pred)
# Sepal.Length 1.093174
# Sepal.Width 2.292513
# Petal.Length 3.979784
# Petal.Width 3.525995
These are all based on 2 degrees of freedom so they can be compared directly. Admittedly, this won't scale to large data sets as efficiently as the scikit-learn binning, but it does demonstrate a use of $\chi^2$ statistics for predictor comparison without a contingency table.
The function to get single-predictor $\chi^2$ statistics from the multinomial model:
WaldChisq <- function(model, predictor) {
cat(predictor,"\t");
coefs<-data.frame(coef(model))[,predictor];
vcovSub <- vcov(model)[grepl(predictor,rownames(vcov(model))), grepl(predictor,colnames(vcov(model)))];
cat(as.numeric(coefs %*% solve(vcovSub,coefs)),"\n")
}
P(X|Y)as opposed to discriminativep(Y|X)approach? Besides, even if this weren't the case, I don't see where the contingency matrix arises as we still have continuous variables in the picture. – Tfovid Oct 05 '22 at 12:15