Getting SGDRegressor to converge to equivalent RidgeCV R2 results

Question

I have a model of some financial data that achieves an R2 of ~0.01 (1%) using RidgeCV -- this is about what I expect. I'm exploring building the equivalent model using SGDRegressor so I can leverage partial_fit to do incremental training over larger than memory data sets... Unfortunately my SGDRegressor does not converge to the same R2 as RidgeCV even for equivalent L2, squared_loss and alpha values. I've tried fiddling with many parameters but it's not clear how to pick params...

Any advice on getting SGDRegressor to converge similiarly to RidgeCV? E.g if I have a 4M row data set is it best to call 10000's of epochs across many small (32 rows) batches with partial_fit or do large batches (500,000 rows)?


EPOCH_COUNT=10000
CHUNK_SIZE = 500000
Initialize SGDRegressor and StandardScaler
sgd = SGDRegressor(verbose=0,
                   tol=1e-4,
                   loss='squared_error',
                   penalty='l2',
                   alpha=0.1,
                   learning_rate='constant',
                   eta0=0.01,
                   fit_intercept=True,  # play with this
                   shuffle=True,
               )

scaler = StandardScaler()
regmodel = make_pipeline(StandardScaler(), sgd)
r2s=[]
r2_test=[]
Load data in chunks
for i in range(0, len(df_targets), CHUNK_SIZE):
    df_chunk = df_targets.iloc[i:i+CHUNK_SIZE]
# split df_chunk into train and set set using sklearn
chunk, chunk_test = train_test_split(df_chunk, test_size=0.1, shuffle=False)


y = chunk[hyper['target_name']].values
X = chunk[features].values    

X_test = chunk_test[features].values
y_test = chunk_test[hyper['target_name']].values

if len(df_chunk) != CHUNK_SIZE:
    print('skipping chunk {} as len(X) = {} != CHUNK_SIZE = {}'.format(i, len(X), CHUNK_SIZE))
    break


# Hack
if i == 0:
    # Fit the scaler on the first chunk only
    print('fitting scaler on first chunk')
    scaler = scaler.fit(X)
X = scaler.transform(X, copy=False)
X_test = scaler.transform(X_test, copy=False)


for j in range(0, EPOCH_COUNT):



    sgd.partial_fit(X,y)
    level_2_r2_score = sgd.score(X,y)
    level_2_r2_score
    r2s.append(level_2_r2_score)


    r2_score_test = sgd.score(X_test,y_test)    
    r2_test.append(r2_score_test)

    print('chunk {}  - EPOCH {} - ===  &gt; r2_train = {} - r2_test = {} '.format(i, j, level_2_r2_score, r2_score_test))

Getting SGDRegressor to converge to equivalent RidgeCV R2 results

Initialize SGDRegressor and StandardScaler

Load data in chunks

0 Answers0