0

I have a model of some financial data that achieves an R2 of ~0.01 (1%) using RidgeCV -- this is about what I expect. I'm exploring building the equivalent model using SGDRegressor so I can leverage partial_fit to do incremental training over larger than memory data sets... Unfortunately my SGDRegressor does not converge to the same R2 as RidgeCV even for equivalent L2, squared_loss and alpha values. I've tried fiddling with many parameters but it's not clear how to pick params...

Any advice on getting SGDRegressor to converge similiarly to RidgeCV? E.g if I have a 4M row data set is it best to call 10000's of epochs across many small (32 rows) batches with partial_fit or do large batches (500,000 rows)?


EPOCH_COUNT=10000 CHUNK_SIZE = 500000

Initialize SGDRegressor and StandardScaler

sgd = SGDRegressor(verbose=0, tol=1e-4, loss='squared_error', penalty='l2', alpha=0.1, learning_rate='constant', eta0=0.01, fit_intercept=True, # play with this shuffle=True,

               )

scaler = StandardScaler() regmodel = make_pipeline(StandardScaler(), sgd)

r2s=[] r2_test=[]

Load data in chunks

for i in range(0, len(df_targets), CHUNK_SIZE): df_chunk = df_targets.iloc[i:i+CHUNK_SIZE]

# split df_chunk into train and set set using sklearn
chunk, chunk_test = train_test_split(df_chunk, test_size=0.1, shuffle=False)


y = chunk[hyper['target_name']].values
X = chunk[features].values    

X_test = chunk_test[features].values
y_test = chunk_test[hyper['target_name']].values

if len(df_chunk) != CHUNK_SIZE:
    print('skipping chunk {} as len(X) = {} != CHUNK_SIZE = {}'.format(i, len(X), CHUNK_SIZE))
    break


# Hack
if i == 0:
    # Fit the scaler on the first chunk only
    print('fitting scaler on first chunk')
    scaler = scaler.fit(X)
X = scaler.transform(X, copy=False)
X_test = scaler.transform(X_test, copy=False)


for j in range(0, EPOCH_COUNT):



    sgd.partial_fit(X,y)
    level_2_r2_score = sgd.score(X,y)
    level_2_r2_score
    r2s.append(level_2_r2_score)


    r2_score_test = sgd.score(X_test,y_test)    
    r2_test.append(r2_score_test)

    print('chunk {}  - EPOCH {} - ===  > r2_train = {} - r2_test = {} '.format(i, j, level_2_r2_score, r2_score_test))

0 Answers0