32

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object.

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

To merge these predictions back with the original df, I try this:

df['y_hats'] = y_hats

But that raises:

ValueError: Length of values does not match length of index

I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and test). How can I align these predicted values with the appropriate rows in my df, since the y_hats array is zero-indexed and seemingly all information about which rows were included in the X_test and y_test is lost? Or will I be relegated to splitting dataframes into train-test first, and then building feature matrices? I'd like to just fill the rows included in train with np.nan values in the dataframe.

blacksite
  • 11,076
  • 8
  • 58
  • 102
  • 1
    I believe that `sklearn` supports `DataFrames` and `Series` as args to `train_test_split` so it should work by passing a sub-section of your df, besides what is returned are the indices so you can use these to index back into your df using `iloc`, see docs: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html – EdChum Nov 21 '16 at 20:56

8 Answers8

35

your y_hats length will only be the length on the test data (20%) because you predicted on X_test. Once your model is validated and you're happy with the test predictions (by examining the accuracy of your model on the X_test predictions compared to the X_test true values), you should rerun the predict on the full dataset (X). Add these two lines to the bottom:

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

EDIT per your comment, here is an updated result the returns the dataset with the prediction appended where they were in the test datset

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
flyingmeatball
  • 6,607
  • 7
  • 40
  • 58
  • That doesn't really solve my issue of merging only those data that were in `test` to begin with. If you merge back predictions for every row, how do you know which were in the original `test` matrices? For all I know, I could run the lines you added, but have no idea whether the model already saw some of the rows in X (therefore kind of invalidating the whole purpose of train-test). – blacksite Nov 21 '16 at 21:07
  • 2
    @flyingmeatball hi I am trying to do the exact same thing but when you have the y_hats stored as a variable it becomes a numpy array rather then a dataframe that needs to get converted to pandas to do the merge. At that point, the merge on indices can not be done. I am not sure what am I missing? – bernando_vialli May 04 '18 at 15:56
  • 1
    y_test['preds'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1] – asmgx Jun 20 '19 at 23:05
4

I have the same problem (almost)

I fixed it this way

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)
asmgx
  • 6,348
  • 12
  • 63
  • 109
1

You can create a y_hat dataframe copying indices from X_test then merge with the original data.

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

Note, left join will include train data rows. Omitting 'how' parameter will result in just test data.

Adam Milecki
  • 1,548
  • 9
  • 15
1

Try this:

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2
James Mchugh
  • 855
  • 8
  • 22
  • Welcome to Stack Overflow. Thank you for contributing an answer. I think your answer could be further improved using this [article](https://stackoverflow.com/help/how-to-answer). Any chance you can add more context to this? – James Mchugh Dec 08 '19 at 03:11
0

You can probably make a new dataframe and add to it the test data along with the predicted values:

data['y_hats'] = y_hats
data.to_csv('data1.csv')
Nidhi Garg
  • 366
  • 1
  • 5
  • 16
  • data['y_hats'] = y_hats causing this error [ValueError: Wrong number of items passed 2, placement implies 1] – asmgx Jun 20 '19 at 23:10
0
predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)
Ayoub ZAROU
  • 2,367
  • 5
  • 19
Reshma2k
  • 71
  • 1
  • 5
0

This worked well for me. It maintains the indexing positions.

pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class  = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)
user115916
  • 147
  • 9
-3

you can also use

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']
api55
  • 10,520
  • 4
  • 38
  • 54
ambar003
  • 5
  • 4