2

I am trying to include steps in a pipeline that transform the data, ex. balancing the dataset. This pipeline is intended to be used with cross_val_score. However, I want to apply some transformations only over the training fold, since it may have no sense to apply them to the test set before scoring, ex. resampling the data. In the code below, I observed that cross_val_score calls transform twice for each partition (once for each train/test).

There is a built-in solution for this in scikit-learn to avoid writing an ad-hoc transformer with some patches to count calls, etc?

from sklearn import linear_model
from sklearn.base import TransformerMixin
from sklearn.datasets.samples_generator import make_blobs
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline

class CustomTransformer(TransformerMixin):
    _calls = 0

    def fit( self, X, y = None ):
        return self 

    def transform( self, X, y = None ):
        '''... just for illustrating the problem '''
        self._calls += 1
        print('Shape: {0}  call: {1}'.format(X.shape[0], self._calls))
        return X

X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=0)

pipe = Pipeline(steps=[('ct', CustomTransformer()), \
                       ('logistic', linear_model.LogisticRegression(solver='lbfgs', multi_class='auto'))])

scores = cross_val_score(pipe, X, y, cv=KFold(n_splits=4,random_state=0))

Thanks!

jias
  • 21

1 Answers1

0

I don't know if your custom transformer works but it looks like you are doing things correctly. The pipeline should preprocess your X set and then perform cross validation on your processed X set and your y set using the indicated model.

  • Thanks for your response, however, it seems that using imblearn.pipeline.Pipeline instead of sklearn.pipeline.Pipeline solves my problem. – jias Nov 19 '19 at 20:41
  • Hi there, does imblearn.pipeline.Pipeline work even with custom Transformers (having .fit() and .transform() methods) ? – JeromeK Nov 19 '20 at 15:56