8

I thought that the chi squared (χ²) test is to be used when one has an r × c contingency matrix, i.e., when the dependent variables are nonnegative, span the same r dimension, and typically represent a count, frequency, or boolean value (cf. for example, p. 124 of Practical Statistics for Data Scientists by Bruce and Bruce).

I'm therefore confused by this scikit-learn tutorial which uses the χ² test in ranking the features of the Iris dataset. Aren't those feature continuous? Where then is the rectangular contingency matrix from which the Pearson residuals, and hence the chi-squared value are derived?

PS: Here, I'd consider the c flower categories as the independent variables and the four features are the dependent variables. Would that be accurate?

Tfovid
  • 785
  • 1
  • 6
  • 14
  • 1
    flower categories are what you are predicting, so they are the dependent variable and features are independent. – rep_ho Oct 05 '22 at 12:12
  • 1
    @rep_ho Yes, but in the context of feature selection and ranking, aren't we instead taking a generative P(X|Y) as opposed to discriminative p(Y|X) approach? Besides, even if this weren't the case, I don't see where the contingency matrix arises as we still have continuous variables in the picture. – Tfovid Oct 05 '22 at 12:15
  • Glancing at your linked scikit page, it seems to selecting the $k$ "best features" that do best at predicting the iris species. So presumably that compares the predicted species (using various subsets of features) and actual species in something equivalent to a contingency table. You do not need actual 0-1 predictions of species; you could do it with probabilities which give you an expected number in each cell to enable you to do the $\sum \frac{(O-E)^2}{E}$ calculation for your $\chi^2$ test – Henry Oct 05 '22 at 13:33

4 Answers4

10

The $\chi^2$ distribution describes a sum of independent standard normal variables. Although it's usually encountered by students first in the context of contingency tables, it has much wider use.

It's the basis of likelihood-ratio tests. The Wald test for coefficients of generalized linear models is based on an asymptotic $\chi^2$ distribution. The F-distribution used in analysis of variance and ordinary linear regression is based on a ratio of $\chi^2$-distributed variables.

So a $\chi^2$ value including all coefficients involving a predictor in a model (including nonlinear terms and interactions, or all levels of a categorical predictor) provides a useful summary of the contribution of that predictor to any regression model. If the predictors use up different degrees of freedom, comparison is best done by subtracting the corresponding degrees of freedom (the mean under the null hypothesis) from each $\chi^2$.

That said, be very wary of such attempts at automated model selection. Section 5.4 of Frank Harrell's course notes and book illustrate how variable such variable selection based on $\chi^2$ can be.

Illustration of this type of $\chi^2$ for predictor comparison

Other answers have shown that thescikit-learn function in question bins the continuous features to generate a contingency table. Here's an example of how you could use the Wald $\chi^2$ to evaluate predictor importance without binning. With the iris data set, do a multinomial regression of Species on the continuous predictors.

library(nnet)
mnIris <- multinom(Species~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data = iris, Hess=TRUE,maxit=200)

With code (shown below) to extract the 2 coefficients for each continuous predictor and the corresponding subset of the coefficient covariance matrix, display the $\chi^2$ values.

for(pred in names(iris)[1:4]) WaldChisq(mnIris,pred)
# Sepal.Length  1.093174 
# Sepal.Width   2.292513 
# Petal.Length  3.979784 
# Petal.Width   3.525995 

These are all based on 2 degrees of freedom so they can be compared directly. Admittedly, this won't scale to large data sets as efficiently as the scikit-learn binning, but it does demonstrate a use of $\chi^2$ statistics for predictor comparison without a contingency table.


The function to get single-predictor $\chi^2$ statistics from the multinomial model:

WaldChisq <- function(model, predictor) {
    cat(predictor,"\t");
    coefs<-data.frame(coef(model))[,predictor];
    vcovSub <- vcov(model)[grepl(predictor,rownames(vcov(model))), grepl(predictor,colnames(vcov(model)))];
    cat(as.numeric(coefs %*% solve(vcovSub,coefs)),"\n")
}
EdM
  • 92,183
  • 10
  • 92
  • 267
7

So there are two things here, 1. can the chi-squared test be used on data that cannot be represented as a contingency table? 2. what is scikit-learn doing?

  1. most commonly, when you search for the chi-squared test, you will find the test on the contingency table, however, the chi-squared is more general, and it refers to any test that uses the chi-squared distribution. With this, you can test the goodness of fit of many different models, not just the contingency table. The most important test here is the likelihood-ratio test, which is also a chi-squared test, because the test statistics should follow the chi-squared distribution. This test compares the goodness of fit of two nested models, and most of your standard tests can be expressed as comparing two nested models and are equivalent to some form. So the chi-squared test can be used to test pretty much anything.

  2. It is not uncommon that scikit-learn implements some statistical procedures badly. This is because, as developers say, it is a machine learning library and not a stats library and the developers are also not statisticians but machine learners. And they get a bit rude and defensive when someone points out that something doesn't work as a stats person would expect (at least in the past).

Anyway. In the docs for chi2 it says

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

So it seems like the function is implementing the standard chi-squared test on the contingency tables, and in this case, its use in the iris tutorial would be wrong. I assume, but I am not going to check the code for it, that the function just calculates the chi-squared statistics according to the formula that you would use for the contingency table, but using any feature values.

rep_ho
  • 7,589
  • 1
  • 27
  • 50
  • 1
    An example of an sklearn implementation that irks a particular statistician... – Dave Oct 05 '22 at 14:11
  • 3
    Thanks for checking the docs in scikit-learn, which I try to avoid. (+1) – EdM Oct 05 '22 at 14:19
  • 2
    +1 your second point hits the hammer on the screw. The chi-squared tests, as applied to tables, can not just be applied everywhere with continuous data. It requires counts. So what the sklearn algorithm does is falsely interpreting the continuous values as counts. – Sextus Empiricus Oct 06 '22 at 09:45
3

There are other chi-squared tests, but the numpy function chi2 performs the Pearson's chi-squared test for contingency tables.

The Pearson's chi-squared test computes expected and observed frequencies and then passes these to a function that computes a chi-squared statistic with the formula

$$\chi^2 = \sum_{\forall i} \frac{(O_i - E_i)^2}{E_i}$$

This formula is specific for counts data. This should not be used for continuous variables (and neither for frequencies mentioned in the source code comments and in the manual, frequencies are fractions of counts).

The reason that you must use counts is because the statistic is based on a specific relationship between the mean and the variance of the counts for data that follows a multinomial distribution. With other type of data than counts, the relationship between the mean and variance can be completely different.

def _chisquare(f_obs, f_exp):
    """Fast replacement for scipy.stats.chisquare.
    Version from https://github.com/scipy/scipy/pull/2525 with additional
    optimizations.
    """
    f_obs = np.asarray(f_obs, dtype=np.float64)
k = len(f_obs)
# Reuse f_obs for chi-squared statistics
chisq = f_obs
chisq -= f_exp
chisq **= 2
with np.errstate(invalid=&quot;ignore&quot;):
    chisq /= f_exp
chisq = chisq.sum(axis=0)
return chisq, special.chdtrc(k - 1, chisq)


def chi2(X, y): """Compute chi-squared stats between each non-negative feature and class. This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes. Recall that the chi-square test measures dependence between stochastic variables, so using this function "weeds out" the features that are the most likely to be independent of class and therefore irrelevant for classification. Read more in the :ref:User Guide &lt;univariate_feature_selection&gt;. Parameters ---------- X : {array-like, sparse matrix}, shape = (n_samples, n_features_in) Sample vectors. y : array-like, shape = (n_samples,) Target vector (class labels). Returns ------- chi2 : array, shape = (n_features,) chi2 statistics of each feature. pval : array, shape = (n_features,) p-values of each feature. Notes ----- Complexity of this algorithm is O(n_classes * n_features). See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. f_regression: F-value between label/feature for regression tasks. """

# XXX: we might want to do some of the following in logspace instead for
# numerical stability.
X = check_array(X, accept_sparse='csr')
if np.any((X.data if issparse(X) else X) &lt; 0):
    raise ValueError(&quot;Input X must be non-negative.&quot;)

Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
    Y = np.append(1 - Y, Y, axis=1)

observed = safe_sparse_dot(Y.T, X)          # n_classes * n_features

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)

return _chisquare(observed, expected)

  • Great job. Although as dipetkov pointed out in his answer, chi squared calculation on frequencies is wrong, but it won't change the ranking of features so for the purposes of feature selection it doesn't make a difference. Using it for continuous values is still wrong tho – rep_ho Oct 06 '22 at 11:50
  • @rep_ho I would disagree with that. Using the chi-squared test with frequencies is wrong. It only happens to give the same result when the conversion factor to get from counts to frequencies is the same for all cells and when it is being used for variable selection with a fixed number. But, just that the result happens to be the same doesn't make it right. – Sextus Empiricus Oct 06 '22 at 16:10
  • But it is a feature selector in this case, do that's what it is used for. Ofc if you need a valid chi squared statistics, then you should use something else. You can say that a desire selection based on this formula is equivalent to feature selection using the correct formula. – rep_ho Oct 06 '22 at 17:05
  • @rep_ho but what if there are different total counts in the different features? Also, the motivation behind using a chi-squared statistic for feature selection is to be statistically about it. To rank the features based on statistical significance. This can go wrong when frequencies are being used. It doesn't when all the total counts are the same, but that doesn't need to be the case and can be often different. The feature selection has also other methods, use those for the different types of variables. – Sextus Empiricus Oct 06 '22 at 17:32
  • I think you are right. Obviously 1:4 should not have the same significance as 1000:4000 – rep_ho Oct 06 '22 at 20:06
1

This question is in part about programming: How does sklearn.feature_selection uses the chi2 criterion to rank features? It's a good question, so not surprisingly it already has a good answer on Stack Overflow: How SelectKBest (chi2) calculates score?

So let's consider another interesting question, this one more appropriate for Cross Validated: In what cases is scikit-learn's chi2 criterion useful for feature ranking and selection, if at all?

The other answers discuss at length that the Pearson's chi-squared test is a test (for goodness of fit, homogeneity or independence) on contingency tables of counts. So if the features are not counts, then the chi2 criterion is not applicable.

A couple of observations that might help to explain the procedure as a heuristic rather than as an appropriate application of statistical theory. (The intended use of SelectKBest?)

  • The ranking of the features won't change if instead of raw counts we use relative frequencies as long as the counts are normalized by the same total. Since SelectKBest chooses a fixed number of features, the result will be the same. So at least on paper the procedure is fine for counts or frequencies (with the same total) though not positive continuous variables in general.٭
  • The chi-squared statistic is more sensitive to small differences in large counts. Intuitively, if some features have small total count and other features have large total count (eg. dummy variables vs term counts in document classification), the large-count features are more likely to be selected. The feature totals would act a bit like feature weights.

So an appropriate use of the chi2 criterion is when the features are counts and of the same type; for example: all dummy variables or all term frequencies. (If there are different types of features, it may be better to SelectKBest from each type first and FeatureUnion the selected features, not the other way round.)

The applicability of the procedure may still be limited though. As @EdM points out, automated feature selection, whatever the selection criterion, has many pitfalls.

٭ SelectKBest is designed to be generic, so it uses the scores not the pvalues to rank and select the top k features. We can define our own scoring functions as well.

dipetkov
  • 9,805
  • thanks for the update – rep_ho Oct 06 '22 at 11:51
  • "This is the definition" It should be noted that this is not just a definition like some arbitrary choice or convention. The Pearson's chi-squared test is essentially connected to count data because it is derived based on the assumption of categorical distributed data*. The ranking with the formula applied to frequencies instead of counts happens to be the same but it only occurs when the errors in all chi-squared values have the same relative error. This might happen in special cases but it makes the use of the chi-squared statistic still wrong. – Sextus Empiricus Oct 06 '22 at 16:43
  • *(this makes the requirements actually more strict than just counts data, and the process should also relate to an iid categorical distribution) – Sextus Empiricus Oct 06 '22 at 16:43
  • @SextusEmpiricus I disagree with the strong language about "right" and "wrong". SelectKBest is a heuristic rule, not an application of well founded statistical theory. And it's designed to do feature ranking and selection (whether that's a fine idea or not). To me at least it's not a discrepancy to use a heuristic rule with a heuristic criterion. Personally I won't use either. – dipetkov Oct 06 '22 at 16:52