Merging results from model.predict() with original pandas DataFrame?

Question

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

But that raises:

ValueError: Length of values does not match length of index

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

I believe that `sklearn` supports `DataFrames` and `Series` as args to `train_test_split` so it should work by passing a sub-section of your df, besides what is returned are the indices so you can use these to index back into your df using `iloc`, see docs: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html — EdChum, Nov 21 '16 at 20:56

flyingmeatball · Accepted Answer · 2016-11-21T21:11:57.127

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

That doesn't really solve my issue of merging only those data that were in `test` to begin with. If you merge back predictions for every row, how do you know which were in the original `test` matrices? For all I know, I could run the lines you added, but have no idea whether the model already saw some of the rows in X (therefore kind of invalidating the whole purpose of train-test). — blacksite, Nov 21 '16 at 21:07
@flyingmeatball hi I am trying to do the exact same thing but when you have the y_hats stored as a variable it becomes a numpy array rather then a dataframe that needs to get converted to pandas to do the merge. At that point, the merge on indices can not be done. I am not sure what am I missing? — bernando_vialli, May 04 '18 at 15:56
y_test['preds'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1] — asmgx, Jun 20 '19 at 23:05

score 4 · Answer 2 · answered Jun 20 '19 at 23:40

I have the same problem (almost)

I fixed it this way

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

score 1 · Answer 3 · answered Oct 29 '19 at 18:24

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

score 1 · Answer 4 · edited Dec 08 '19 at 04:15

1

Try this:

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

edited Dec 08 '19 at 04:15

James Mchugh

855
8
22

answered Dec 08 '19 at 00:20

PATRICK KANYI

11
3

Welcome to Stack Overflow. Thank you for contributing an answer. I think your answer could be further improved using this [article](https://stackoverflow.com/help/how-to-answer). Any chance you can add more context to this? – James Mchugh Dec 08 '19 at 03:11

score 0 · Answer 5 · answered Apr 28 '19 at 20:35

0

You can probably make a new dataframe and add to it the test data along with the predicted values:

data['y_hats'] = y_hats
data.to_csv('data1.csv')

answered Apr 28 '19 at 20:35

Nidhi Garg

366
1
5
16

data['y_hats'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1] – asmgx Jun 20 '19 at 23:10

score 0 · Answer 6 · edited Aug 13 '20 at 13:52

0

predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)

edited Aug 13 '20 at 13:52

Ayoub ZAROU

2,367
5
19

answered Aug 13 '20 at 11:05

Reshma2k

71
1
5

score 0 · Answer 7 · answered Mar 22 '21 at 14:36

This worked well for me. It maintains the indexing positions.

pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class  = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)

score -3 · Answer 8 · edited Dec 20 '17 at 14:30

-3

you can also use

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']

edited Dec 20 '17 at 14:30

api55

10,520
4
38
54

answered Dec 20 '17 at 13:58

ambar003

5
4

Merging results from model.predict() with original pandas DataFrame?

8 Answers8

Linked

Related