Working with the scikit-learn library for python, consider a linear regression model such as the elastic net (ElasticNet class).
Furter assume that one wishes to work with a normalised feature space whatever the reasons. Two options naturally come to mind:
Instanciate an
ElasticNetobject with thenormalizeattribute set to true (should one simultaneously set thefit_interceptattribute, he/she should make sure it is not set to false in which case thenormalizeargument would be ignored, see relevant docstring)Create a
Pipelineconsisting of aNormalizer(pre-processing method) and anElasticNetwithnormalizeattribute set to false.
These approaches are similar. However, it seems like the user community tends to prefer the second option.
This is because when cross-validation is applied to a pipeline object rather than a model object, for instance through cross_val_score(pipe, X, y), the feature space preprocessing is part of the full learning process (i.e. is applied appropriately for each CV fold).
Now, suppose that instead of working with the 'naive' elastic net, one were to work with an elastic net whose hyper parameters are determined by cross-validation (ElasticNetCV class for instance).
In that case, option 2 above does not seem to be the right way to go. More specifically, since the normaliser is fitted on the training set, when we'll work through the internal cross-validation (hyper-parameter determination), we will work with folds that have been normalised using data that is not part of the fold, which is typically data snooping.
In otherwords, the pipeline way of doing things seems fine for simple cross-validation but could be dangerous for nested cross-validation since it could produce optimistically biased cross-validation scores.
Can someone confirm this or am I missing something?
normalize=Trueis equivalent to using asklearn.preprocessing.Normalizeron the features (i.e. $X_i/\lVert X_i \rVert$), not asklearn.preprocessing.StandardScaler(zero mean and unit variance). (2) I don't get the "the transformation is independent of other data", to meNormalizerorStandardScalerare both transforms applied to some data $X$. Data snooping appears when you later try to perform CV on $X$ (contaminated folds), either with one or the other – Quantuple Jan 26 '17 at 15:06sklearn.linear_model.base._preprocess_datacenters the data and then normalizes it, which is equivalent to usingStandarScaler(with_std=False)followed byNormalizer. Fixing the answer. – Danica Jan 26 '17 at 15:24Normalizer, the transformation is applied to the data independently: $X_i \mapsto X_i / \lVert X_i \rVert$ no matter what other points are in the training set. So doing it on the full dataset or on a fold at a time is exactly equivalent, other than the latter being slightly slower. When you center first likeElasticNet(normalize=True)does, it then does depend on the dataset you do it as a part of. – Danica Jan 26 '17 at 15:26normalize=Falseon the pre-rpocessed data (b) elastic net withnormalize=Truedirectly on the raw data. Of course I've checked that raw data are not centred. – Quantuple Jan 26 '17 at 15:46normalize=TruetheElasticNetCVis subject to data snooping. – Quantuple Jan 26 '17 at 17:55