1

Please help me. I am trying to give maximum information related to my query.

I have a query regarding pre-processing function in scikit-learn library. My data set are divided into 3 parts train, test, holdout.

I am using StandardScalar function for preprocessing of data.

### preprocessing of data

scaler = preprocessing.StandardScaler().fit(x_train)

X_train = scaler.transform(x_train)

X_holdout = scaler.transform(x_holdout) 

X_test = scaler.transform(x_test)

Same scaler function I applied on train, test and holdout data for preprocessing according to discussed in SKlearn document. I am getting the required result. But when I am trying to predict the new data by saved model I noticed two points

1) When I am using same scaler result obtain from training data on new data I am getting good accuracy.

# save the model to disk
import pickle

filename = 'mySVM_100ms_rbf_model.sav'

pickle.dump(svclassifier_rbf, open(filename, 'wb'))

# load the model from disk
loaded_model_rbf1 = pickle.load(open(filename, 'rb'))

# Predicting the new datawith saved model
df2=pd.read_csv("A5.csv",na_values=['NA','?']) ## reading new data
## Preprocessing using previous scaler
df_2= scaler.transform(df2) # with previous scaler
## Prediction with trained scaler
newdata_pred_rbf1 = loaded_model_rbf1.predict(df_2) ## 90 % Accuracy

2) But after closing my program when I am reloading the trained model and I am trying to predict new data I am getting 1 percent accuracy.

## New preprocessing because last one is not saved after the closing program
scaler = preprocessing.StandardScaler().fit(df2)## again preprocessing done

df_2   =  scaler.transform(df2) # With new scaler
## prediction with new scaler preprocessing function
newdata_pred_rbf1 = loaded_model_rbf1.predict(df_2) ## 90 % Accuracy

Because when I predict the new data using trained model scaler variable I got good result but when I closed the program and again loaded the save model then I have to do again preprocessing on my newq data for predction.

So please explain i am doing mistake ??

Shivam Soni
  • 13
  • 1
  • 3

1 Answers1

2

sklearn also includes sklearn.externals.joblib which accomplishes the same thing as pickle but is better optimized for sklearn objects. It can be used to save your scaler as well as your model.

from sklearn.externals import joblib

scaler = preprocessing.StandardScaler().fit(x_train)

# Save it
scaler_file = "my_scaler.save"
joblib.dump(scaler, scaler_filename) 

# Load it 
scaler = joblib.load(scaler_file) 

Then the same idea for the model, just change the file names

This A5.csv you're using is totally new data right? Ideally it'd have the same mean and standard deviation for each feature but it doesn't look they may be just different enough to throw off the model. Still you should also be doing cross-validation on the data you have and checking for signs of over-fitting (like training accuracy >>> test accuracy)

shabuki
  • 201
  • 2
  • 5