Pipeline and data snooping in scikit-learn

Question

Working with the scikit-learn library for python, consider a linear regression model such as the elastic net (ElasticNet class).

Furter assume that one wishes to work with a normalised feature space whatever the reasons. Two options naturally come to mind:

Instanciate an ElasticNet object with the normalize attribute set to true (should one simultaneously set the fit_intercept attribute, he/she should make sure it is not set to false in which case the normalize argument would be ignored, see relevant docstring)
Create a Pipeline consisting of a Normalizer (pre-processing method) and an ElasticNet with normalize attribute set to false.

These approaches are similar. However, it seems like the user community tends to prefer the second option.

This is because when cross-validation is applied to a pipeline object rather than a model object, for instance through cross_val_score(pipe, X, y), the feature space preprocessing is part of the full learning process (i.e. is applied appropriately for each CV fold).

Now, suppose that instead of working with the 'naive' elastic net, one were to work with an elastic net whose hyper parameters are determined by cross-validation (ElasticNetCV class for instance).

In that case, option 2 above does not seem to be the right way to go. More specifically, since the normaliser is fitted on the training set, when we'll work through the internal cross-validation (hyper-parameter determination), we will work with folds that have been normalised using data that is not part of the fold, which is typically data snooping.

In otherwords, the pipeline way of doing things seems fine for simple cross-validation but could be dangerous for nested cross-validation since it could produce optimistically biased cross-validation scores.

Can someone confirm this or am I missing something?

Danica · Accepted Answer · 2017-01-26T16:21:28.547

3

First, just a note that ElasticNet's normalize=True actually isn't quite the same as Normalizer: it first centers the data (subtracting the mean of the training set), then scales each of the centered data points to unit norm.

If you do a pipeline of Normalizer followed by ElasticNet(fit_intercept=True), it will actually normalize the data points to unit norm in the original space, then center the normalized data (which is a little weird).

Since ElasticNet always centers its inputs when you have fit_intercept=True, if you do StandardScaler(with_std=False) (which just centers), Normalizer, and then ElasticNet(fit_intercept=True) you'll actually center, normalize, and then re-center – you end up with slightly different data inside the model, though the overall model should be the same.

If you were only normalizing (replacing each data point $X_i$ with $X_i / \lVert X_i \rVert$), the transformation is independent of the other data, so the CV folds don't matter. Centering, though, is not data-independent.

So, you're correct that centering before ElasticNetCV will center the data based on the whole dataset, and thus technically the elastic net's CV is "cheating." To be totally correct, you should use normalize=True on the ElasticNetCV; if you want to do some other kind of preprocessing, you won't be able to (as far as I know) use ElasticNetCV properly at all. Honestly, the whole CV machinery in scikit-learn is not a great fit for cases that are at all complicated, and I often find myself rolling my own CV loops to handle these issues – but it's hard to do that while still taking advantage of the efficiency gains in ElasticNetCV.

In practice, as long as your dataset isn't tiny, I wouldn't really worry about the difference. Centering tends to be very stable across CV folds, and it's unlikely that your linear model's performance is going to be sensitive to the very small differences in scaling between full-dataset standardization and 9/10ths of the dataset's. The only parameter being estimated is $\hat \mu$; with a $k$-fold CV on $n$ data points, your data snooping changes the estimate from $$\hat \mu_\text{train} = \frac{k}{n (k-1)} \sum_{i \notin \text{ fold } k} X_i$$ to \begin{align} \hat \mu_\text{all} &= \frac{1}{n} \sum_{i} X_i \\&= \frac{1}{n} \sum_{i \notin \text{ fold } k} X_i + \frac{1}{n} \sum_{i \in \text{ fold } k} X_i \\&= \frac{k-1}{k} \hat\mu_\text{train} + \frac{1}{k} \hat\mu_\text{validation} .\end{align} Since $\hat\mu_\text{train}$ and $\hat\mu_\text{validation}$ are going to be extremely similar anyway unless you have a small sample size compared to your dimension, $\hat\mu_\text{all}$ is going to be very close to $\hat\mu_\text{train}$, and the difference is not going to be something that your model is likely to be able to exploit anyway.

edited Jan 26 '17 at 16:21

answered Jan 26 '17 at 14:55

Danica

24,685

Thanks a lot for your answer @Dougal. (1) I'm using sklearn 0.18 and I do not completely agree: I ran some tests and normalize=True is equivalent to using a sklearn.preprocessing.Normalizer on the features (i.e. $X_i/\lVert X_i \rVert$), not a sklearn.preprocessing.StandardScaler (zero mean and unit variance). (2) I don't get the "the transformation is independent of other data", to me Normalizer or StandardScaler are both transforms applied to some data $X$. Data snooping appears when you later try to perform CV on $X$ (contaminated folds), either with one or the other – Quantuple Jan 26 '17 at 15:06
(3) Got you concerning the practice of setting the nested cross validation loops yourself :) (4) Yes I was about to try and quantify the discrepancies but thank you already for confirming that this is "cheating". – Quantuple Jan 26 '17 at 15:08
(1) I was slightly off: looking at the source, sklearn.linear_model.base._preprocess_data centers the data and then normalizes it, which is equivalent to using StandarScaler(with_std=False) followed by Normalizer. Fixing the answer. – Danica Jan 26 '17 at 15:24
(2) If you only use Normalizer, the transformation is applied to the data independently: $X_i \mapsto X_i / \lVert X_i \rVert$ no matter what other points are in the training set. So doing it on the full dataset or on a fold at a time is exactly equivalent, other than the latter being slightly slower. When you center first like ElasticNet(normalize=True) does, it then does depend on the dataset you do it as a part of. – Danica Jan 26 '17 at 15:26
I understand your point now, thank you for kindly clarifying. I am trying to get my mind around this centering process though. Indeed, in the tests I've ran I really obtain the same scores for (a) pre-processing with a normalizer (default $l_2$-norm) + fitting an elastic net with normalize=False on the pre-rpocessed data (b) elastic net with normalize=True directly on the raw data. Of course I've checked that raw data are not centred. – Quantuple Jan 26 '17 at 15:46
@Quantuple I don't get it: normalizing on uncentered data should be different from normalizing centered data. Is your data approximately centered-ish? – Danica Jan 26 '17 at 15:58
thank you for expanding your answer, this really helps. At the end of the first paragraph, instead of "then sets each of the centered data points to have zero mean" you mean "then normalises each of the centered data points with respect to their norms", right? – Quantuple Jan 26 '17 at 16:07
@Quintuple Hah, yes, fixing. – Danica Jan 26 '17 at 16:21
I'll look a little bit more in the equations that you wrote at the end of your post, cause they seem to be the key to my problem. I've played with the model a little bit and it seems to be as both the centering and the normalization are cross-sectional operations (the normalisation is performed, for each feature ($N$ samples), and not for each data point ($P$ fetures). So it seems to me that even with normalize=True the ElasticNetCV is subject to data snooping. – Quantuple Jan 26 '17 at 17:55

Pipeline and data snooping in scikit-learn

1 Answers1