Please help me. I am trying to give maximum information related to my query.
I have a query regarding pre-processing function in scikit-learn library. My data set are divided into 3 parts train, test, holdout.
I am using StandardScalar function for preprocessing of data.
### preprocessing of data
scaler = preprocessing.StandardScaler().fit(x_train)
X_train = scaler.transform(x_train)
X_holdout = scaler.transform(x_holdout)
X_test = scaler.transform(x_test)
Same scaler function I applied on train, test and holdout data for preprocessing according to discussed in SKlearn document. I am getting the required result. But when I am trying to predict the new data by saved model I noticed two points
1) When I am using same scaler result obtain from training data on new data I am getting good accuracy.
# save the model to disk
import pickle
filename = 'mySVM_100ms_rbf_model.sav'
pickle.dump(svclassifier_rbf, open(filename, 'wb'))
# load the model from disk
loaded_model_rbf1 = pickle.load(open(filename, 'rb'))
# Predicting the new datawith saved model
df2=pd.read_csv("A5.csv",na_values=['NA','?']) ## reading new data
## Preprocessing using previous scaler
df_2= scaler.transform(df2) # with previous scaler
## Prediction with trained scaler
newdata_pred_rbf1 = loaded_model_rbf1.predict(df_2) ## 90 % Accuracy
2) But after closing my program when I am reloading the trained model and I am trying to predict new data I am getting 1 percent accuracy.
## New preprocessing because last one is not saved after the closing program
scaler = preprocessing.StandardScaler().fit(df2)## again preprocessing done
df_2 = scaler.transform(df2) # With new scaler
## prediction with new scaler preprocessing function
newdata_pred_rbf1 = loaded_model_rbf1.predict(df_2) ## 90 % Accuracy
Because when I predict the new data using trained model scaler variable I got good result but when I closed the program and again loaded the save model then I have to do again preprocessing on my newq data for predction.
So please explain i am doing mistake ??