0

I 'm trying to use a batch gradient descent algorithm to do linear regression on a large dataset- I load up as much data as my computer can handle, do a partial fit, print out some diagnostics to a CSV, and repeat

python pseudocode

m_scaler = sklearn.preprocessing.StandardScaler()
m_data = <3600*100*24*9 samples, with 15 X channels, and one y channel>
sgdR_01 = SGDRegressor(n_iter=m_iter,alpha = 10.0**-3)
i=0
while i < 100:
    i=i+1
    df = <select_360000_samples from m_data>
    df = <preprocess>
    m_scaler.partialFit(df[x_channels])
    df[x_channels] =  m_scaler.transform(df[x_channels])
    X_train,y_train,X_test,y_test = trainTestSplit(df)
    sgdR_01.partial_fit(X_train,y_train)
    <track train/test score, train/test MSE, and coefficients for sgdR_01>



preprocessing steps:
    add polynomial combinations of certain channels
    oversample so y has a 'flat histogram'
    randomly select ~ 200000 samples so my computer can handle the data

Right now, I'm hovering around .65 for my R^2 score, but every few iterations my score will drop to like -50 or -900. At the next iteration, it'll be back around .65 What's going on when that happens? Why is SGDR so erratic?

0 Answers0