All intermediate steps should be transformers and implement fit and transform

Question

I am implementing a pipeline using important features selection and then using the same features to train my random forest classifier. Following is my code.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)
sel = SelectFromModel(m, prefit=True)
X_new = sel.transform(train_cv_x)
clf = RandomForestClassifier(5000)

model = Pipeline([('m', m),('sel', sel),('X_new', X_new),('clf', clf),])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(train_cv_x,train_cv_y)

So X_neware the new features selected via SelectFromModel and sel.transform. Then I want to train my RF using the new features selected.

I am getting the following error:

All intermediate steps should be transformers and implement fit and transform, ExtraTreesClassifier ...

Brad Solomon · Accepted Answer · 2018-02-13T03:07:31.083

Like the traceback says: each step in your pipeline needs to have a fit() and transform() method (except the last, which just needs fit(). This is because a pipeline chains together transformations of your data at each step.

sel.transform(train_cv_x) is not an estimator and doesn't meet this criterion.

In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel) already does this transformation--that's why it's included in the pipeline.

Secondly, ExtraTreesClassifier (the first step in your pipeline), doesn't have a transform() method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.

What type of classes are able to do transformations?

Ones that scale your data. See preprocessing and normalization.
Ones that transform your data (in some other way than the above). Decomposition and other unsupervised learning methods do this.

Without reading between the lines too much about what you're trying to do here, this would work for you:

First split x and y using train_test_split. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV's cross-validation will be further broken out into smaller train and validation sets.
Build a pipeline that satisfies what your traceback is trying to tell you.
Pass that pipeline to GridSearchCV, .fit() that grid search on X_train/y_train, then .score() it on X_test/y_test.

Roughly, that would look like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=444)

sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444), 
                      threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)

model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}

gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)

# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)

Two examples for further reading:

If `ExtraTreesClassifier` doesn't have the transform method, then what I can use instead of it to select important features for my classification? @Brad — Abdul Karim Khan, Feb 13 '18 at 02:14
You're already using `SelectFromModel`. Anything within [`sklearn.feature_selection`](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) would work. — Brad Solomon, Feb 13 '18 at 02:18
`SelectFromModel` is taking an estimator object in It's argument. Can I give it as an argument? If I don't give any estimator like you suggested, which estimator it takes by default? Moreover, Is there any way that I may know about the exact features selected. — Abdul Karim Khan, Feb 13 '18 at 03:03
Yes--sorry, you're right, updated my answer with an estimator. As for the features, you can call `model.named_steps['sel'].get_support()`, I believe. — Brad Solomon, Feb 13 '18 at 03:09
Is there any way that I can know about the selected features names and number of the features selected before these selected features automatically go into my Random Forrest Classifier. Because I want to replace the RF with a neural network and the neural network keras model requires the number of features to be known it before training. — Abdul Karim Khan, Feb 13 '18 at 03:16
@AbdulKarimKhan You can wrap the Random Forrest Classifier in a custom transformer, similar to this [post](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html), the section "Custom Transformers", second example. — Marcus V., Feb 13 '18 at 08:15
@MarcusV....I have posted a separate question .[here](https://stackoverflow.com/questions/48762580/getting-names-and-number-of-selected-features-before-giving-to-a-classifier-in-s) with detailed explaination. I would appreciate your answer there. — Abdul Karim Khan, Feb 13 '18 at 08:43

score 0 · Answer 2 · answered Nov 30 '21 at 23:28

This has happened because the first transformer you pass in a pipeline must have both a fit and transform method.

m = ExtraTreesClassifier(n_estimators = 10)
m.fit(train_cv_x,train_cv_y)

Here m does not have a transform method as ExtraTreesClassifier model does not have a transform method and so fails in the pipeline.

So change the order of the pipeline and add another transformer for the first step in the pipeline

All intermediate steps should be transformers and implement fit and transform

2 Answers2

Linked